Hadoop Distributed processing with MapReduce

MapReduce comprises the sequential processing of operations on distributed volumes of data sets. The data comprises of key-value pairs, and the overall process involves two phases. First phase is a map phase and second phase is reduce phase. User-defined MapReduce jobs run on the compute nodes in the cluster.

Generally a Map reduce job run through several phases. During the Map phase, input data is broken down into a large number of sub fragments, each of the fragments is assigned to a map task. Numerous map tasks are then distributed across the cluster. Every map task processes the key-value pairs from its allotted sub-fragment to produces a set of intermediate key-value pairs. Now, this intermediate data set is sorted through key, and this sorted data is again partitioned into a slots of fragments that matches the number of reduce tasks. After this in Reduce phase, every reduce task processes the data fragment which was allotted to it and results an output key-value pair. However, the reduce tasks are also distributed across the cluster and write their results to HDFS when finished.

The Hadoop’s MapReduce framework in earlier i.e. pre-version 2 ,Hadoop releases has a single node master service known as a JobTracker and several slave node services known as TaskTrackers, one per node in the cluster. When we submit a MapReduce job to the JobTracker, the job is dispatched onto a queue and then executes in accordance with the scheduling rules defined by the analyst or admin. As we might expect, that the JobTracker handles the assignment of map-and-reduce work to the TaskTrackers.

In Hadoop 2, a more powerful resource management system is in place called YARN (short for Yet Another Resource Manager). YARN provides generic scheduling and service for management of resources in a way that we can run more than one Map Reduce applications on our Hadoop cluster. The JobTracker-TaskTracker design could only meant to run MapReduce.

Key/value data is the foundation of MapReduce operations which facilitates for a powerful programming model that is unexpectedly widely applicable, and can be seen by the popularity of Hadoop and MapReduce across a wide range of industries and challenging scenarios. Most of data is either intrinsically key-value by nature or can be transformed in such a way. It is one of the simplest models with wide applicability and semantics straightforward enough that programs defined in terms of it can be implemented by a Hadoop framework. Since, only the data model itself is not the reason that makes Hadoop useful; its real power lies in how it utilizes the effective techniques like parallel execution and “divides and conquer” strategies.

blog

Hadoop Distributed processing with MapReduce

Anonymous User

Leave Comment

1 Comments

Comments

Liked By