Hadoop is designed to manage big data and by design this means HDFS is designed to store very large files in a distributed cluster with streaming access to this data. For reference, a typical block in HDFS is 64 MB or 128 MB. Each small file (few MB or less) is stored in a block and multiple small files could be stored in blocks across different nodes of the distributed cluster.
The use of lean practices like Kanban boards has become really popular in project management, especially those using agile methods. But what exactly is Lean project management ?
The application of lean manufacturing principles to project management can be roughly translated as lean project management. These principles were developed at Toyota, with the famous Toyota Production System employing kanban and the concepts of just-in-time (JIT) and “pull” to optimize flow and minimize inventory.
I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr - the slides and presentation can be accessed below (free registration):
As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s One Platform initiative blog post by CEO Mike Olson for their commitment to Spark.
Over the past few weeks, we’ve had several conversations in our data lab regarding data engineering problems and day to day problems we face with unsupervised data scientists who find it difficult to deploy their code into production.
The opinions from business seemed to cluster around a tacit definition of data scientists as researchers, primarily from statistics or mathematics backgrounds, who are experienced in machine learning algorithms and often in some domain areas specific to our business, (e.g. actuaries in insurance), but not necessarily having skills of writing production-ready code.
The key driver behind the somewhat opposing strain of thought came from the developers and data engineers who often quoted Cloudera’s Director of Data Science - Josh Wills - famous for his “definition of a data scientist tweet”:
“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”
I presented a talk last week introducing Data Science and associated topics to some enthusiasts. Here’s a slide deck I created quickly with markdown using Swipe - a start-up building HTML5 presentation tools. The contents include: