of Map Reduce in Hadoop
MapReduce as a technology is relatively new, it builds upon much of the
fundamental work from both mathematics and computer science, particularly
approaches that look to express operations that would then be applied to each
element in a set of data. Indeed the individual concepts of functions called map
and reduce come straight from functional programming languages where they were
applied to lists of input data.
key underlying concept is that of "divide and conquer approach", here
we break a single problem a single problem to form multiple simple specific
subtasks. This approach becomes even more powerful when the subtasks are
executed in parallel; in a perfect case, a task that takes 1000 minutes could
be processed in 1 minute by 1,000 parallel subtasks.
is a processing paradigm that builds upon these principles; it provides a
series of transformations from a source to a result data set. In the simplest
case, the input data is fed to the map function and the resultant temporary
data to a reduce function. In software development the developer only defines the data transformations;
Hadoop's MapReduce work handles and operate the process of applying different
transformations to the data sets across the cluster working with parallel operations.
Though the underlying ideas may not be novel, a major strength of Hadoop is in
how it has brought these principles together into an accessible and
classic relational databases (RDMS’s) which needs structured data on top of well-defined
schemas, MapReduce is best optimized for semi-structured and unstructured data.
In contrast to data conforming to rigid schemas, the requirement is distinctive
that the data sets be processed to the map function as a series of key value
pairs. The result of the map function is a collection of different key value
pairs, and the reduce function works on aggregation to collect the final set of
empowers a standard specification (i.e., interface) for both the map and reduce
functions, and implementations of these functions are typically referred to as mappers and reducers. A specific MapReduce job will consists of a numerous
mappers and reducers, and it is very usual for many of these to be extremely
simple and easy. The analyst focuses on expressing the transition between
source and result data sets, and the Hadoop framework manages every aspect of
job execution, parallelization, and coordination. The platform takes
responsibility and handles every aspect of executing the processing across the
data. Once the analyst defines the key criteria for the job, everything else
becomes the overhead of the system.