Big Data

9 features of modern data architectures

The last few years has seen a massive change in the data landscape. With the rise of big data, there’s been rapid innovation in the tools, skills and roles working on data systems. Data architectures have evolved beyond monolithic, centralized databases and unwieldy analytic applications to distributed, scalable architectures with simpler collaborative and interactive analytic tools. In this post, I look at the defining features of modern data architectures. Modern data architectures generally feature the following (though not all of these may be present in the same system):

Kafka - building real-time stream data pipelines

Over the past few years, Kafka has become the most exciting new addition in the big data distributed architecture. Originally developed at LinkedIn, its founders Jay Kreps, Jun Rao and Neha Narkhede have launched a company Confluent to develop its open-core business model. The software at its core, Apache Kafka reinvents the database log to provide a highly scalable and fault tolerant, high performance distributed system, which serves as the data pipeline backbone for stream data processing.

Hadoop's small files problem

Small files are a big problem in Hadoop. Hadoop is designed to manage big data and by design this means HDFS is designed to store very large files in a distributed cluster with streaming access to this data. For reference, a typical block in HDFS is 64 MB or 128 MB. Each small file (few MB or less) is stored in a block and multiple small files could be stored in blocks across different nodes of the distributed cluster.

What roles do you need in your data science team?

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production. The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.

An introduction to Data Science

I presented a talk last week introducing Data Science and associated topics to some enthusiasts. Here’s a slide deck I created quickly with markdown using Swipe - a start-up building HTML5 presentation tools. The contents include: Data scientist skills Data science: enablers and barriers Big data analytics Data science lifecycle Use cases Tools and technology Project approach Machine learning Skills and roles Learning resources Here are the slides:

Why Spark is the big data platform of the future

Apache Spark has created a lot of buzz recently. In fact, beyond the buzz, Apache Spark has seen phenomenal adoption and has been marked out as the successor to Hadoop MapReduce. Google Trends confirms the hockey stick like growth in interest in Apache Spark. All leading Hadoop vendors, including Cloudera, now include Apache Spark in their Hadoop distribution. So what exactly is Spark, and why has it generated such enthusiasm? Apache Spark is an open-source big data processing framework designed for speed and ease of use.

A gentle introduction to Machine Learning

Machine Learningis a big part of big data and data science. A subset of artificial intelligence - a branch of science notorious for requiring advanced knowledge of mathematics. In practice though, most data scientists don’t try to build a Chappie and there are simpler, practical ways to get started with machine learning. Machine learning in practice involves predictions based on data. Notable examples include Amazon’s product recommendations with the “customers also bought” scroll-list, or Gmail’s priority inbox or any email spam-filter feature.

Designing the future - Data Innovation Labs

With the ongoing Big data revolution, and the impending Internet of Things revolution, there has been a renewed enthusiasm in “innovation” around data. Similar to the Labs concept started by Google (think Gmail Beta based on Ajax, circa 2004), more and more organizations, business communities, governments and countries are setting up Labs to foster innovation in data and analytics technologies. The idea behind these “data innovation labs” is to develop avant-garde data and analytics technologies and products in an agile fashion and move quickly from concept to production.

Set up a Hadoop Spark cluster in 10 minutes with Vagrant

With each of the big 3 Hadoop vendors - Cloudera, Hortonworks and MapR each providing their own Hadoop sandboxvirtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away. However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem.

The data science project lifecycle

How does the typical data science project life-cycle look like? This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

BI in the digital era

Sometime back I presented a webinar on BrightTalk. The slides for the talk have now been uploaded on Slideshare. The talk focused more on changes in digital technology disrupting businesses, the effect of Big Data, the FOMO (Fear of missing out) effect on big business - and what it meant for changes to the way we do business intelligence in the digital era. Key themes: Disruption in traditional IT with cloud computing Changing economics and changing business models Rise of Big Data Tech changes to manage Big Data - distributed computing Shift from “current-state” to “next-state” questions Introducing Data Science Challenges - regulatory, data privacy Dangers of data science - over-fitting, interpretation Managing big data projects Data Science MOOCs (massive open online courses), tools and resources

Big Data Basics - Part 4 : NoSQL and NewSQL explained

This is the fourth part of a series of posts on big data. Read the previous posts here: Part-1, Part-2 and Part-3. With the ongoing data explosion, and the improvement in technologies able to deal with it, businesses are turning to leverage this big datafor mining insights to gain competitive advantage, reinvent business models and create new markets. A huge amount of this “big data” volumes comes from system logs, user generated content on social media like Twitter or Facebook, sensor data and the like.

Basics of Big Data - Building a Hadoop data warehouse

This is the 3rd part of a series of posts on Big Data. Read Part-1 (What is Big Data) and Part-2 (Hadoop). Traditionally data warehouses have been built with relational databases as backbone. With the new challenges (3Vs) of Big Data, relational databases have been falling short of the requirements of handling New data types (unstructured data) Extended analytic processing Throughput (TB/hour loading) with immediate query access The industry has turned to Hadoop as a disruptive solution for these very challenges.

Basics of Big Data – Part 2 - Hadoop

As discussed in Part 1 of this series, Hadoop is the foremost among tools being currently used for deriving value out of Big Data. The process of gaining insights from data through Business Intelligence and analytics essentially remains the same. However, with the huge variety, volume and velocity (the 3Vs of Big Data), it’s become necessary to re-think of the data management infrastructure. Hadoop, originally designed to be used with the MapReduce algorithm to solve parallel processing constraints in distributed architectures (e.

Basics of Big Data - Part 1

You can’t miss all the buzz about Big Data! Over the past few years, the buzz around the cloud and Big Data shaping most of the future of computing, IT and analytics in particular has grown incessantly strong. As with most buzz words, which are then hijacked by marketing to suit their own products’ storylines, but which nonetheless manage to confuse users in business and staff in IT as well, Big Data means several things to several people.

Thrive or Survive - the changing rules for databases

Not since the late seventies, when Larry Ellison’s Relational Software Inc. (RSI) turned out the first commerically available RDBMS - Oracle, has there been such rapid changing of the rules (read disruption) in the database industry. With Web 2.0 pushing enterprise adoption, and the ensuing information explosion in the maze of audio, video, data and ever-growing data warehouses, it seems that the conventional relational database systems are growing tired. With estimates of unstructured data being anywhere between 80% to 95% of all business data, and the ever changing requirements imposed by Web 2.