Data science

What roles do you need in your data science team?

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production. The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.

An introduction to Data Science

I presented a talk last week introducing Data Science and associated topics to some enthusiasts. Here’s a slide deck I created quickly with markdown using Swipe - a start-up building HTML5 presentation tools. The contents include: Data scientist skills Data science: enablers and barriers Big data analytics Data science lifecycle Use cases Tools and technology Project approach Machine learning Skills and roles Learning resources Here are the slides:

A gentle introduction to Machine Learning

Machine Learningis a big part of big data and data science. A subset of artificial intelligence - a branch of science notorious for requiring advanced knowledge of mathematics. In practice though, most data scientists don’t try to build a Chappie and there are simpler, practical ways to get started with machine learning. Machine learning in practice involves predictions based on data. Notable examples include Amazon’s product recommendations with the “customers also bought” scroll-list, or Gmail’s priority inbox or any email spam-filter feature.

Designing the future - Data Innovation Labs

With the ongoing Big data revolution, and the impending Internet of Things revolution, there has been a renewed enthusiasm in “innovation” around data. Similar to the Labs concept started by Google (think Gmail Beta based on Ajax, circa 2004), more and more organizations, business communities, governments and countries are setting up Labs to foster innovation in data and analytics technologies. The idea behind these “data innovation labs” is to develop avant-garde data and analytics technologies and products in an agile fashion and move quickly from concept to production.

Set up a Hadoop Spark cluster in 10 minutes with Vagrant

With each of the big 3 Hadoop vendors - Cloudera, Hortonworks and MapR each providing their own Hadoop sandboxvirtual machines (VMs), trying out Hadoop today has become extremely easy. For a developer, it is extremely useful to download a get started with one of these VMs and try out Hadoop to practice data science right away. However, with the core Apache Hadoop, these vendors package their own software into their distributions, mostly for the orchestration and management, which can be a pain due to the multiple scattered open-source projects within the Hadoop ecosystem.

The data science project lifecycle

How does the typical data science project life-cycle look like? This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

BI in the digital era

Sometime back I presented a webinar on BrightTalk. The slides for the talk have now been uploaded on Slideshare. The talk focused more on changes in digital technology disrupting businesses, the effect of Big Data, the FOMO (Fear of missing out) effect on big business - and what it meant for changes to the way we do business intelligence in the digital era. Key themes: Disruption in traditional IT with cloud computing Changing economics and changing business models Rise of Big Data Tech changes to manage Big Data - distributed computing Shift from “current-state” to “next-state” questions Introducing Data Science Challenges - regulatory, data privacy Dangers of data science - over-fitting, interpretation Managing big data projects Data Science MOOCs (massive open online courses), tools and resources

A Brief Introduction to Statistics - Part 1

What is Statistics? Collected observations are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. Each observation in data is called a case. Characteristics of the case are called variables. With a matrix/table analogy, a case is a row while a variable is a column. Statistics - Correlation (Courtesy: xkcd.com) Types of variables: Numerical - Can be discrete or continuous, and can take a wide range of numerical values.

SPC – Using statistics to get insight from BI

There is a well known adage that if you keep doing the same thing and expect different results, that is a sure sign of idiocy. In the BI world too, we come across several instances where people take it for granted that the ‘BI tool’ will magically generate insight and spur ‘intelligence’ rather than ‘idiocy’. Yet the very practices of reporting the same measures, or of creating reports for metrics just because they are now made available by the tool, without sparing any ‘intelligence’ into what will generate insight is a major cause of failures of BI.