Hadoop Java API
has gone through some big API change in
its 0.20 release, which is the basic interface in the 1.0 version .Though the
previous API was definitely functional, the developers feel that it was a bit unwieldy and unnecessarily complex. The
new API, sometimes generally known as context objects, is seen as the
future of Java's MapReduce development.
The Hadoop 0.20 and above versions of MapReduce API consists of most of the important implementation classes and interfaces either in the package: org.apache.hadoop.mapreduce or its subpackages. In most of the cases, the implementation of a MapReduce task will provide task-specific subclasses of the Mapper and Reducer base classes found in these package.
The Hadoop MapReduce API is implemented in Java, so MapReduce applications are generally Java-based. The following list specifies the key components of a MapReduce application that we can develop:
·Driver (mandatory): It is the application shell that’s invoked from the client. It configures the MapReduce Job class (which we do not customize) and submits it to the Resource Manager (or JobTracker if we are using Hadoop 1).
·Mapper class (mandatory): The Mapper class we implement needs to define the formats of the key/value pairs we input and output as we process each record. This class has only a single method, known as map, which are where we code how every record will be processed and what key/value to output. To output key/value pairs from the mapper task, we need to write them to an instance of the Context class.
·Reducer class (optional): The reducer, this class is optional for map-only applications in which we won’t need reduce phase.
·Combiner class (optional): A combiner can usually be defined as a reducer, however in some cases it needs to be different. (Remember, for instance, that a reducer may not be able to run multiple iterations on a data set without mutating the results.)
·Partitioner class (optional): Customize the default partitioner to perform specific tasks, like a secondary sort on the values for every key or for very rare cases involving sparse data and imbalanced output files from the mapper tasks.
·RecordReader and RecordWriter classes (optional): Hadoop has some standard data formats (for instance, text files, sequence files, and databases), which are useful for most of the cases. If we are dealing with specifically formatted data, implementing our own classes for reading and writing data purpose can greatly simplify our mapper and reducer code.