Maloy Manna

Data, Tech, Cloud Security & Agile Project Management

Azure Control plane and Data plane

Azure operations can be divided into 2 categories: Control plane (or Management plane) - used to manage resources in azure subscriptions, e.g. creation of a virtual machine or a storage account All requests for control plane operations are sent to the Azure Resource Manager URL. For Azure global, the url is: https://management.azure.com Data plane - used to manage capabilties exposed by instances of resource types e.g. using remote desktop protocol (RDP) to interact with a virtual machine, or reading/writing data in a storage account.

9 features of modern data architectures

The last few years has seen a massive change in the data landscape. With the rise of big data, there’s been rapid innovation in the tools, skills and roles working on data systems. Data architectures have evolved beyond monolithic, centralized databases and unwieldy analytic applications to distributed, scalable architectures with simpler collaborative and interactive analytic tools. In this post, I look at the defining features of modern data architectures. Modern data architectures generally feature the following (though not all of these may be present in the same system):

Kafka - building real-time stream data pipelines

Over the past few years, Kafka has become the most exciting new addition in the big data distributed architecture. Originally developed at LinkedIn, its founders Jay Kreps, Jun Rao and Neha Narkhede have launched a company Confluent to develop its open-core business model. The software at its core, Apache Kafka reinvents the database log to provide a highly scalable and fault tolerant, high performance distributed system, which serves as the data pipeline backbone for stream data processing.

Hadoop's small files problem

Small files are a big problem in Hadoop. Hadoop is designed to manage big data and by design this means HDFS is designed to store very large files in a distributed cluster with streaming access to this data. For reference, a typical block in HDFS is 64 MB or 128 MB. Each small file (few MB or less) is stored in a block and multiple small files could be stored in blocks across different nodes of the distributed cluster.

Data processing with Spark in R & Python

I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr - the slides and presentation can be accessed below (free registration): As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s One Platform initiative blog post by CEO Mike Olson for their commitment to Spark. In data science R had seen rapid adoption, not only because it was open source and free compared to costly SAS, but also the huge number of statistical and graphical packages provided by R for data science.