The Magician we know as Hadoop

5 min readSep 17, 2020

The 21 Century is all about data. All the world leaders in Technology knew the significance of data all along. The more the data, the more the power. All the tech giants have data as their backbone. We all know this now, don’t we? Even if you are even a beginner in the field, with anything related to data you might have heard this a lot of times,

Without data, you’re just another person with an opinion. -W. Edwards Deming.

An opinion doesn’t even count as an opinion if it is ain’t backed you either data or Mathematics. So the underlying statement is the more the data the merrier your business.

The data volume is increasing on an exponential level. The more the data you gather the more storage you need to store the data. Data is physical right, you have to store data somewhere if you are gathering it. 2.5 quintillion bytes of data are produced by humans every day. For an organization that stores some chunks of this enormous data, it can’t store huge data on just one Server, as there exists no manufacturing company capable of creating such huge servers our resources that can store up data this huge for a permanent basis.

Let’s talk about this problem in detail relating to a tech-giant. As we are discussing the problem of BIG-DATA, let’s use the reference of the tech-giant that gathers most of the data in the world.

Arguably the world’s most popular social media network with more than two billion monthly active users worldwide, Facebook stores enormous amounts of user data, making it a massive data wonderland. It’s estimated that there will be more than 183 million Facebook users in the United States alone by October 2019. Facebook is still under the top 100 public companies in the world, with a market value of approximately $475 billion.

The fact of the matter is that most experts agree that over 90% of all data in the world has been created in the past few years, and a major portion of it has something to do with social media.

Facebook revealed some big, big stats on big data to a few reporters at its HQ in 2012, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half-hour. Deep breaths…

500+ TB of data each data and that too recorded in a survey from 2012. Almost a decade before! Imagine data stored each day today in 2020. Awestruck, eh.

So let’s try to get an Idea about how FaceBook is capable of storing such enormous data for decades and how to does it efficiently.

So we had a problem a big problem of Big-Data, and the world needed a savior, and data needed room to settle in, maybe a condo. On the first of April, a magician arrived, not to fool you though but to free the world from its big data storage misery.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop follows a master-slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively. The master node for data storage is Hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave nodes in the Hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex calculations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be set up in the cloud or on-premise.

This is a high-level intuition of the Hadoop Architecture.

“Facebook runs the world’s largest Hadoop cluster,”

says Jay Parikh, Vice President Infrastructure Engineering, Facebook.

Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes.

Facebook Inc. analytics chief Ken Rudin says, “Big Data is crucial to the company’s very being.” He goes on to say that, “Facebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its hardware for this purpose. Hadoop is just one of many Big Data technologies employed at Facebook.”

Even if Hadoop plays an important role in Storing Big-Data, it’s just one of the tools used in data processing at FaceBook. This is a world of integrations, Facebook uses some more tools such as Hive, Cassandra, and Prism for data processing.

In this article, I just want to introduce you to Hadoop. We will have a detailed discussion about Hadoop in detail in another article in near future.

The Magician we know as Hadoop

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Prathamesh Mistry

No responses yet