the beginning of the Hadoop’s history, MapReduce has been the complete game
changer in town when it comes to deal with data processing. The availability of
MapReduce has been the primary reason behind the success of Hadoop and at the
meantime is a major factor in limiting further adoption.
empowers skilled programmers and coders to write distributed applications
without considering the complexities involved in the underlying distributed
computing infrastructure. This is a great deal: Hadoop and the MapReduce
framework resolves all sorts of complexity that application developers don’t
need to manage. For instance, the ability to transparently scale out the
cluster by adding nodes and the automated failover mechanism for both data storage
and data processing sub-tasks happen with almost zero impact on applications.
face of the coin here is that, however MapReduce hides and manages a tremendous
amount of complexity, we can’t afford to forget what it is: an interface for
parallel programming. This is a very useful skill — and act as a barrier to
wider adoption. There are rarely good MapReduce programmers in the industry,
and not everyone got the skill to master it.
Hadoop’s beginning days (Hadoop 1 and before), we could able top only run
MapReduce applications on the clusters. But in Hadoop 2, the YARN component
changed every aspect by taking over resource management and scheduling from the
MapReduce framework, and enables a generic interface to facilitate applications
to run on a Hadoop cluster. To conclude, this means MapReduce is now just one
of many application frameworks we can use as a tool to develop and run
applications on Hadoop. Although it’s definitely possible to run applications using
other frameworks on Hadoop, but it doesn’t mean that we can start forgetting about
MapReduce. At the time, MapReduce is still the only production-ready data
processing framework available for Hadoop.
other useful frameworks will eventually become available, MapReduce is
comprised of almost a decade of maturity under its belt (with almost 4,000 JIRA
issues completed, involving hundreds of developers, if we still keeping track).
There’s no argument: MapReduce is Hadoop’s most mature and powerful framework
for data processing. In addition, a significant amount of MapReduce code is now
in use and is unlikely to go anywhere soon. Long story short: MapReduce is an
important part of the Hadoop story.
the parallel strategies requires an altogether different approach. What’s more,
as the computation problems get even a little more difficult, they become even
harder when they need to be parallelized. For wider-ranging complex jobs such as
statistical processing or text extraction, and especially for processing
unstructured data, we need to use MapReduce.