Distributed processing with MapReduce
comprises the sequential processing of operations on distributed volumes of
data sets. The data comprises of key-value pairs, and the overall process involves
two phases. First phase is a map phase and second phase is reduce phase.
User-defined MapReduce jobs run on the compute nodes in the cluster.
a Map reduce job run through several phases. During the Map phase, input data
is broken down into a large number of sub fragments, each of the fragments is
assigned to a map task. Numerous map tasks are then distributed across the
cluster. Every map task processes the key-value pairs from its allotted sub-fragment
to produces a set of intermediate key-value pairs. Now, this intermediate data
set is sorted through key, and this sorted data is again partitioned into a
slots of fragments that matches the number of reduce tasks. After this in
Reduce phase, every reduce task processes the data fragment which was allotted
to it and results an output key-value pair. However, the reduce tasks are also
distributed across the cluster and write their results to HDFS when finished.
Hadoop’s MapReduce framework in earlier i.e. pre-version 2 ,Hadoop releases has
a single node master service known as a JobTracker and several slave node services
known as TaskTrackers, one per node in the cluster. When we submit a MapReduce
job to the JobTracker, the job is dispatched onto a queue and then executes in
accordance with the scheduling rules defined by the analyst or admin. As we might
expect, that the JobTracker handles the assignment of map-and-reduce work to
Hadoop 2, a more powerful resource management system is in place called YARN (short
for Yet Another Resource Manager). YARN provides generic scheduling and service
for management of resources in a way that we can run more than one Map Reduce
applications on our Hadoop cluster. The JobTracker-TaskTracker design could
only meant to run MapReduce.
Key/value data is the foundation of MapReduce
operations which facilitates for a powerful programming model that is
unexpectedly widely applicable, and can be seen by the popularity of Hadoop and
MapReduce across a wide range of industries and challenging scenarios. Most of
data is either intrinsically key-value by nature or can be transformed in such
a way. It is one of the simplest models with wide applicability and semantics
straightforward enough that programs defined in terms of it can be implemented
by a Hadoop framework. Since, only the data
model itself is not the reason that makes Hadoop useful; its real power lies in
how it utilizes the effective techniques
like parallel execution and “divides and conquer” strategies.