Maloy Manna

Hadoop's small files problem

2016/02/16

Big Data / architecture Big Data / architecture / Hadoop

Reading time: 2 minutes

Small files are a big problem in Hadoop.

Hadoop is designed to manage big data and by design this means HDFS is designed to store very large files in a distributed cluster with streaming access to this data. For reference, a typical block in HDFS is 64 MB or 128 MB. Each small file (few MB or less) is stored in a block and multiple small files could be stored in blocks across different nodes of the distributed cluster.

5 principles of lean project management

2016/02/02

project management lean / project management

Reading time: 2 minutes

The use of lean practices like Kanban boards has become really popular in project management, especially those using agile methods. But what exactly is Lean project management ?

The application of lean manufacturing principles to project management can be roughly translated as lean project management. These principles were developed at Toyota, with the famous Toyota Production System employing kanban and the concepts of just-in-time (JIT) and “pull” to optimize flow and minimize inventory.

Data processing with Spark in R & Python

2015/11/18

Reading time: 2 minutes

I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr - the slides and presentation can be accessed below (free registration):

As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s One Platform initiative blog post by CEO Mike Olson for their commitment to Spark.

What roles do you need in your data science team?

2015/06/24

Big Data / Data science data engineer / data science team / data scientist / product manager

Reading time: 7 minutes

Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production.

Data scientist

The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.g. actuaries in insurance), but not necessarily having skills of writing production-ready code. The key driver behind the somewhat opposing strain of thought came from the developers and data engineers who often quoted Cloudera’s Director of Data Science - Josh Wills - famous for his “definition of a data scientist tweet”: “Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”

An introduction to Data Science

2015/05/20

Big Data / Data science / Machine Learning big data / data science / introduction / machine learning

Reading time: 1 minute

I presented a talk last week introducing Data Science and associated topics to some enthusiasts.
Here’s a slide deck I created quickly with markdown using Swipe - a start-up building HTML5 presentation tools.
The contents include:

Data scientist skills
Data science: enablers and barriers
Big data analytics
Data science lifecycle
Use cases
Tools and technology
Project approach
Machine learning
Skills and roles
Learning resources

Here are the slides:

Maloy Manna

Data, Tech, Cloud Security & Agile Project Management

Hadoop's small files problem

5 principles of lean project management

Data processing with Spark in R & Python

What roles do you need in your data science team?

An introduction to Data Science