More on MapReduce

4 min readNov 7, 2023

Interested in learning Big Data concepts from basics to advance just through small articles? Hit the follow button!

In the previous ‘Hadoop Architecture’ article, we saw what MapReduce is, what are the 2 phases of MapReduce, why key, and value pairs, and how a record reader works.

In this article let’s dig deep into ‘Map’ and ‘Reduce’ with an example and why not to choose MapReduce!

Using MapReduce for small data doesn’t make sense because the communication between Map and Reduce itself takes a lot of time.
We should try to do more work at the mapper end (shift your computation) than at the reducer end because the mapper runs on multiple machines parallelly, whereas the reducer does aggregations while running on 1 machine only.

Example:

Let’s find out the number of LinkedIn profile views from below data:

Let’s check how many people viewed Brinda’s, Ajay’s, and Geetha’s profiles.

→ Below is the O/P from the record reader and I/P to the mapper:

→ The O/P from Mapper will be as below:

The above (K, V) pairs are shuffled (it only means sending data to the reducer) then they are sorted in the reducer. If we don’t have a reducer we don’t have to shuffle (sending data to the reducer) and sort (happens in the reducer).

→ Below are the 2 steps of sorting that will happen in the reducer:

→ The final O/P of the total profile views is as follows:

The tough part is to determine what you want as an output from the mapper.

Number of blocks = Number of mappers. The file system will decide on the number of mappers, not us. By default, we will have 1 reducer but as a developer, we can either increase or decrease (0) the reducers.

Varying the number of Reducers:

Let’s consider, all the mappers are finishing their job in 2 mins and the reducer is taking 10 mins (total 12 mins), then you may consider increasing the workload on mappers, and if it still doesn’t cut down the time significantly, increase the number of reducers.

After increasing the workload on mappers let’s say the time now is mappers 3 mins, and reducer 5 mins. To further improve the time let’s add 1 more reducer (now total 2) and now maybe the time taken is 2.5 mins each for the reducers but since they also run in parallel the total time would be (3+2.5) 5.5 mins.

When there are tasks like filters, etc., which can be done by the mappers, we don’t need any reducer so make it to 0. This means we don’t even have to shuffle and sort for the reducer and the O/P will be 8 files from the mapper {if we consider data size 1000 MB, we need 8 blocks (mappers too 8 only)}.

If we have just 1 reducer, then all the mapper’s O/P will go to that reducer without any logic but if we have more than 1 let’s say 2 reducers then which O/P from which mapper will go to which reducer? A concept called partition comes into the picture here. The O/P from each mapper will be divided into 2 parts/ partitions — partition 1 and partition 2. All the partition 1 outputs from all mappers will be sent to reducer 1 as I/P and all the partition 2 outputs from all mappers will be sent to reducer 2.
How will the O/P from a mapper be divided into 2 and sent to partitions 1 and 2? What is the logic behind it? By default, there is a system-defined hash function. You can also write your own function.

Why not MapReduce?

Performance: It takes a lot of disk I/Os (Input, outputs).
It’s very hard to write such lengthy, complex codes for simple tasks too.
It will not support real-time processing; it supports only batch processing.
We need to learn the whole ecosystem (like pig, hive, scoop, etc.) thoroughly if we want to master it.
Every problem we have no matter how different it is, we need to frame/fit it down to the Map and Reduce type which is challenging.

Hit the clap, comment your views if you got any value from this article (You can clap up to 50 times!), your appreciation means a lot to me :)

Feel free to connect and message me on my LinkedIn.

References:

Sumit Sir’s Big Data

More on MapReduce

Written by Brinda Potluri

No responses yet