Everything you need to know about MapReduce
All the key insights from the paper MapReduce: Simplified Data Processing on Large Clusters from Google.
Table of contents
The Motivation
The Model
The MapReduce Implementation
Support Features
Intro
You were a Data Engineer in 2009.
Your company has TB of data stored in HDFS.
You wrote the logic for the data process.
You ran it on a single machine.
It took you half a day to finish.
So, how did you optimize that?
You realized you had to use more machines.
But how could you reliably parallel the computation?
Remember you were in 2009:
One year before the release of Google BigQuery (2010)
Two years before the development of Apache Spark (2012)
Four years before the release of Amazon Redshift (2013)
Five years before the release of Snowflake on AWS (2014)
The most potential choice for you was MapReduce, the large-scale parallel processing framework first introduced by Google in 2004 and lately open-sourced by Yahoo.
This week, we will learn about this framework through the classic paper from Google: MapReduce: Simplified Data Processing on Large Clusters.
The Motivation
At Google, hundreds of computations process large amounts of data. Most of them are straightforward. However, the data is too large to process in a single machine; the computation must be distributed to hundreds or thousands of machines to execute and finish reasonably. Here comes the challenges:
How to parallelize the computation?
How to distribute the data efficiently?
How to handle failures?
To solve this, Google designed a new abstraction that allows them to express simple computations but abstracts away the details of parallelization. This model is inspired by the map
and reduce
primitives in Lisp and other functional languages. The primary Google contributions of this work:
Simple and powerful interface to define parallel computation.
Enables automatic parallelization and distribution of large-scale computations.
Can achieve high performance on commodity machines.
The Model
The model has two functions, both of which are defined by the users:
Map: It takes key/value pair inputs, processes them, and outputs intermediate key/value pairs. The library will group all values of the same key and pass them to the Reduce tasks.
Reduce: It receives intermediate values from Map tasks. The intermediate values are supplied to the Reduce via an iterator. It then merges the intermediate values from the same key using the logic defined in the Reduce function (e.g., Count, Sum, ...). The Reduce typically produces at most one output value.
After definition, the MapReduce program will be parallelized and executed on a large cluster of commodity machines. The run-time will handle data partitioning, fault tolerance, and machine communication without user intervention.
The MapReduce Implementation
Execution Overview
The flow
The system automatically partitions the data into a set of M splits. The Maps invocations on these M splits will be distributed on multiple machines. These splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R buckets using a partitioning function (e.g., hash function on the key). Users can define the number of partitions (R) and the partitioning function.
For the typical MapReduce flow, here are the common steps:
The MapReduce library splits the input files into M pieces of typically 16 to 64 megabytes (MB) per piece (the user can configure the size/piece factor.)
It then initiates copies of the program on multiple machines. (We can think that multiple machines will run the MapReduce process.)
One of the MapReduce processes is called the master. The rest are workers who receive the work from the master.
There will be M Map and R Reduce tasks; the idle worker will receive a Map or Reduce task from the master.
A Map worker reads the corresponding splits. It passes each pair to the user-defined Map function (e.g., multiply each value by X or ). The worker buffers the intermediate key/value pair outputs in memory.
The worker writes the buffered pairs to the local disk periodically, and then the worker will inform the master about the location of these pairs in the disk.
The master will notify the Reduce workers about these locations. When notified, the Reduce workers use remote procedure calls to read the buffered data from the local disks of the Map workers.
When finishing reading intermediate data, the Reduce worker sorts the data by the intermediate keys so that all occurrences of the same key are grouped. The Reduce worker needs to sort the data because it has to handle different keys from the map worker. Sorting, make sure the values from the same key are close to each other.
The Reduce worker loops over the sorted intermediate data, and for each unique key, it then passes the key and the corresponding set of intermediate values to the user’s Reduce function (e.g., Average, Sum,…). The output of the Reduce function is appended to a final output file for this reduced partition.
When all Map and Reduce tasks have been completed, the master wakes up the user program (a kind of asynchronous process).
After successful completion, the output of the MapReduce execution is available in the R output files associated with the R bucket. Returning separate files allows users to input these result files into another MapReduce program or a different distributed application.
Master Data Structures
The metadata
The master stores the state of each map and reduces tasks and the identity of the worker machine. The master also acts like the "broker" between Map and Reduce workers: it informs the Reduce workers of the location of intermediate files from Map workers. For each completed Map task, the master stores the locations and sizes of the intermediate files. When the Map workers finish their jobs, they update the location and size information of the intermediate files to the master. The master pushes this information to workers who have in-progress Reduce tasks.
Fault Tolerance
One of the ultimate goals of MapReduce is reliably processing a large amount of data on multiple machines. So, what if the failure occurs?
The failure of the worker
The master checks the worker's health by pinging periodically.
If there is no response from the worker in an amount of time, the master considers the worker to fail.
Completed or failed Map tasks reset to idle state so the master can re-schedule these tasks on different machines. Completed Map tasks need to be re-executed because their output is stored on the local disks, so if the Map machine fails, the Map results are inaccessible, too.
Failed Reduce tasks are also set to idle so they can be rescheduled. Completed Reduce tasks need not be re-executed since their results are stored in a global file system.
When worker A processes the map task, and later worker A fails, worker B will be in charge of this map task based on the schedule of the master, and all Reduce workers will be notified of the re-execution. Any Reduce task that first consumed Map Worker A's data will read Worker B's data.
The failure of the master
The master writes periodic checkpoints of its metadata
If the master dies, a new master can be started from the last checkpoint state.
Locality
Google utilizes network bandwidth by taking advantage of the fact that the input data managed by GFS is stored on the local disks of the cluster's machines. GFS will divide each file into 64 MB blocks and store redundant replicas of each block (default 3) on different machines. The MapReduce Master tries to schedule a Map task on a machine that stores a replica of the corresponding input. If the master can not schedule that way, it attempts to schedule a Map task near a replica of that task’s input data. This ensures most workers read input data locally and consume no network bandwidth.
Task Granularity
The MapReduce divides the M phase into M pieces and the Reduce phase into R (constrained by the users) pieces. Typically, M and R should be much larger than available workers. Having each worker perform many different tasks improves dynamic load balancing and speeds up recovery when a worker fails: the fail map tasks it has completed can be spread out across all the other worker machines. There are limits on the size of M and R to avoid causing too much overhead on the master since it must make O(M + R) scheduling decisions and keep the O(M ∗ R) state in memory.
Backup Tasks
A common reason for increasing overall latency for a MapReduce operation is a “straggler,” a machine that takes longer to complete one of the few maps or reduce tasks. Stragglers can happen for many reasons, such as a machine with a bad disk or the master scheduling multiple tasks on the machine, causing resource contention between tasks. Google has a solution for this. When a MapReduce operation is close to finishing, the master schedules backup executions of the remaining in-progress tasks. These backup tasks will run side by side with the primary tasks. The task is marked as completed whenever the primary or the backup execution is completed. So, if the primary worker has a problem that slows down the processing, the backup tasks that don’t have the straggler symptoms can take care of the process more efficiently.
Support Features
Although MapReduce’s basic functionality is sufficient for most needs, Google has found a few extensions helpful.
Partitioning Function
A default data partitioning function is hashing. To support situations when other logic is needed, the user of the MapReduce library can provide a custom partitioning function.
Ordering Guarantees
MapReduce guarantees that the intermediate key/value pairs are processed within a given partition in increasing key order. This makes it easy to generate a sorted output file per partition. It is useful when users find it convenient to have the data sorted.
Combiner Function
The Combiner function is executed on the map worker. Typically, users use the same code to implement the Combiner and the Reduce functions. The only difference between them is how the MapReduce library handles the output of the function:
The output of a reduce function is written to the final output file.
The output of a Combiner function is written to an intermediate file that will be sent to a Reduce task.
Input and Output Types
The MapReduce provides support for reading input data in several different formats. For example, the text type treats the key as the offset in the file and the value as the contents of the line. Users can add support for a new input type by implementing a simple reader interface. For the output type, MapReduce supports a set of output types for producing data in different formats and also allows users to define new output types.
Side-effects
MapReduce lets users specify if mapping or reducing tasks can produce extra files as additional outputs.
Skipping Bad Records
Sometimes, there are bugs in user code that cause the Map or Reduce functions to crash on certain records. These bugs prevent a MapReduce program from completing. The common solution is to fix the bug, but sometimes more is needed; the bug can come from a third-party library for which source code is unavailable. Also, it is acceptable to ignore a few records in some cases. Google provides an optional execution mode where the MapReduce library detects which records can cause crashing and skips these records to make forward progress.
Status Information
The master runs an internal HTTP server and exports a set of status pages for monitoring and tracking the MapReduce programs. The status pages show the progress of the computation: completed tasks, progress tasks, input size, intermediate size and output size, etc. Moreover, the pages also contain links to the standard error and standard output files generated by each task. The top-level status page shows which workers have failed and which map and reduce tasks they were processing when they failed. The user can use this data to predict how long the computation will take and whether or not more resources should be added.
Counter
The MapReduce library provides a counter to count occurrences of various events. For example, the user code may want to count the total number of words processed. To use this feature, the user creates a named counter object and then increments the counter appropriately in the Map and Reduce function. The workers periodically report the counter values to the master; the report is sent with the response to the ping-health-check request from the master. The master then aggregates the counter values from successful Map and Reduce tasks and returns them to the user code when the program is completed. When aggregating counter values, the master discards the duplications due to the effects of duplicate executions from the same Map/Reduce task, Backup tasks, or re-execution of failures tasks.
Outro
Through the article, I’ve just noted down all the key insights from the paper MapReduce: Simplified Data Processing on Large Clusters. As you can see, despite its straightforwardness with the two Map and Reduce functions, the framework provides a very efficient and robust way to parallel the computation on a large cluster of commodity machines. One more fact before ending the article: Dremel - the BigQuery processing engine, is developed with MapReduce's inspiration.
Now, it’s time to say goodbye.
See you on the next blog ;)
References
[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters (2004).
Before you leave
Leave a comment or contact me via LinkedIn or Email if you:
Are interested in this article and want to discuss it further.
Would you like to correct any mistakes in this article or provide feedback?
This includes writing mistakes like grammar and vocabulary. I happily admit that I'm not so proficient in English :D
👋Junaid here! I hope you enjoyed this guest post covering deep dive into MapReduce. spent five days to prepare so you can read it in five minutes, if you find this valuable lets support him by sharing♻️and subscribing 🔔:
Are we moving back to Hadoop? 😅
I still think it's useful to understand the programming model.
Really nice article, enjoyed it :)