Hadoop Architecture

Brinda Potluri
6 min readOct 23, 2023

--

Interested in learning Big Data concepts from basics to advance just through small articles? Hit the follow button!

Firstly, let’s see why we need Hadoop architecture.

The volume of data is enormous, it can be of any format like structured (database tables), semi-structured (JSON, CSV, XML), or unstructured (videos, images), the speed at which the data is being generated is huge, the data may not always be of high quality we need to be able to handle that, and should bring out value from the data to drive business success.

Our traditional systems are not built to handle this; therefore, we need robust technology like Hadoop.

Let’s dive into Hadoop Architecture!

Hadoop is an open-source framework that has a collection of tools and technologies to solve our big data problems! It has 3 core components:

  1. HDFS: Hadoop Distributed File System for distributed storage.
  2. MapReduce: Distributed Processing (it’s considered outdated). Writing MapReduce code is extremely hard and lengthy, which is in Java, so it is popularly replaced by Apache Spark.
  3. YARN: Yet Another Resource Negotiator (Resource Manager).

Some other components of Hadoop are:

  1. Sqoop: It’s like you scoop out ice cream. Here, you scoop out and in data from source to destination (cloud alternative — Azure Data Factory).
  2. Pig: This is a scripting language mainly used to clean the data.
  3. Hive: Provides SQL kind of interface to query the data. For Example: when we have 1 million records and you want to fetch 600kth record, the hive will take a very long time to execute. So, in such scenarios where your requirement is pretty clear (random access is possible in NoSQL), go for NoSQL, this is where HBase comes into the picture.
  4. Oozie: Workflow scheduler (cloud alternative — Azure Data Factory).
  5. HBase: is NoSQL DB (cloud alternative — Cosmos DB).

HDFS:

HDFS follows Master-Slave architecture where the name node is the master, and the data node is the slave. The name node consists of metadata information to keep track of where the data is while the data node consists of the actual data. When a client requests a file, the request will 1st go to the name node, once the location of the file is known it will be directed to the data node.

HDFS stores data in the form of blocks, each block has a default size of 128 MB (each node can have more than 1 block). Let’s consider a scenario where we have 400 MB of data (in reality the data will be in the size of petabytes, this example is just for simplicity) which means we need 4 data nodes to store this data (size breakdown — 128, 128, 128, 16). If you were thinking, why can’t I just increase the block size and wrap it up in just 2 data nodes (size breakdown — 256, 144), you are right! You can do that, but there are downsides.

By increasing the size of the block, you are reducing parallelism, but on the bright side, you are reducing the burden on the name node. You can even decrease the block size (size breakdown — 64, 64, 64, 64, 64, 64, 16) you are increasing parallelism, but at the same time, you are increasing the burden on the name node. So, in order to maintain a balance, the sweet spot is 128 MB.

Fault Tolerance:

What if a data node fails?

There is a concept in HDFS called a replication factor whereby default 3 copies of the data will be stored in 3 different nodes. If 1 node fails, a copy of the data in the other node becomes available.

What if the name node fails?

If the name node fails, the entire Hadoop cluster can be considered dead. So, there will always be a secondary name node that becomes active if the primary name node fails. Also, in the Hadoop latest versions, the primary name node can be more than 1 to reduce the burden on 1 name node, this is called as name node federation. This will help us overcome performance issues.

What if a data node fails? How will the name node know about it?

Every 3 seconds, data nodes send a heartbeat to the name node to inform that it is alive. If any data node fails to send a heartbeat continuously for 30 seconds, the name node will consider that particular data node as dead.

MapReduce:

MapReduce is a programming paradigm because traditional programming languages work when our data is stored in only 1 machine and MapReduce is specialized for when our data is stored in a distributed manner.

There are 2 phases in MapReduce, Map and Reduce. Both of those take a key, value as input and give a key, value as output {(K, V) → Map → (K, V); (K, V) → Reduce → (K, V)}; Example: (Roll number, Name). If you give anything apart from the key, value, they won’t recognize it.

Whatever the code user writes will be transferred to all the machines and when the code runs on those machines, we say a mapper is running. All those machines run in parallel. We can say the code is going to the data (from the 1st point of this paragraph) because the code size will be in kilobytes, but the data’s size is humongous.

Once all the machines have processed mappers, the output of those will go to 1 machine (1 machine among those) it can be any machine (data node). In that machine, the Reduce code (aggregate all the results we get) will run and give the final output. So, we can say that the output from a mapper is not the final output, it’s an intermediate output that will help us to get the result quickly.

Therefore, mappers give us parallelism (not reducer) but parallelism is still defined by the number of nodes and for each block, there is 1 mapper.

You must be thinking about how we convert all the data to the key, value form for the mapper. Well, ‘Record Reader’ will convert each line to key value form by adding an address (usually a big number) as key and the entire line as value.

Record Reader’s I/P will be the entire line let’s say ‘my name is brinda’ and the O/P will be (59874804, my name is brinda). Mapper will take that as an I/P and give each word along with the number 1 (because we want to find the frequency of each word let’s consider it) then all this o/p data will be moved to a machine so that a reducer can work on it. Here the key is the word, and the value is 1. The O/P will be shuffled, and sorting will be done, where we get the same words together, finally, it will be given to the reducer. The reducer will give the final O/P.

Hit the clap, comment your views if you got any value from this article (You can clap up to 50 times!), your appreciation means a lot to me :)

Feel free to connect and message me on my LinkedIn.

References:

  1. Sumit Sir’s Big Data

--

--

No responses yet