After we have stored piles and piles of data in HDFS (a distributed storage system spread over an expandable cluster of individual slave nodes), the first question that comes to mind is “How can we analyse or query this data?” Transferring all this data to a central node for processing isn’t going to work, since this way, we will be waiting forever for the data to transfer over the network (not to mention waiting for everything to be processed serially). So what’s the solution? And the solution is “MapReduce”!!
In his early days, Google faced this exact problem with their Distributed Google File System (GFS), and came up with their MapReduce data processing model as the best possible solution. Google needed to be able to grow their data storage and processing capacity, and the only feasible model was a distributed system. As we already know, that there are number of the benefits of storing data in the Hadoop Distributed File System (HDFS): low cost, fault- tolerant, and easily scalable, to name just a few. In Hadoop, MapReduce integrates with HDFS to provide the exact same benefits for data processing.
At first glance, the strengths of Hadoop sound too good to be true — and overall the strengths truly are good! But there is a cost here: writing applications for distributed systems is completely different from writing the same code of centralized systems. For applications to take advantage of the distributed slave nodes in the Hadoop cluster, the application logic will need to run in parallel.
At its core, MapReduce is a programming model for processing data sets that are stored in a distributed manner across a Hadoop cluster’s slave nodes. The key concept here is divide and conquer. Specifically, we want to break a large data set into many smaller pieces and process them in parallel with the same algorithm. With the Hadoop Distributed File System (HDFS), the files are already divided into bite-sized pieces. MapReduce is what we use to process all the pieces
With Map Reduce, We need to break our application up into phases. The first phase is the map phase, which is where every record in the data set is processed individually. Here, we extract the required data from the data record it’s assigned, and then export a key/value pair, with the carrier code as the key and the value being an integer one. The map operation is run against every record in the data set. After every record is processed, we need to ensure that all the values are grouped together for each key, which is the extracted data, and then sorted by key. This is known as the shuffle and sort phase. Finally, there is the reduce phase, where you add the total number of ones together for each extracted data, which gives us the final result set.