Though MapReduce as a technology is relatively new, it builds upon much of the fundamental work from both mathematics and computer science, particularly approaches that look to express operations that would then be applied to each element in a set of data. Indeed the individual concepts of functions called map and reduce come straight from functional programming languages where they were applied to lists of input data.
Another key underlying concept is that of "divide and conquer approach", here we break a single problem a single problem to form multiple simple specific subtasks. This approach becomes even more powerful when the subtasks are executed in parallel; in a perfect case, a task that takes 1000 minutes could be processed in 1 minute by 1,000 parallel subtasks.
MapReduce is a processing paradigm that builds upon these principles; it provides a series of transformations from a source to a result data set. In the simplest case, the input data is fed to the map function and the resultant temporary data to a reduce function. In software development the developer only defines the data transformations; Hadoop's MapReduce work handles and operate the process of applying different transformations to the data sets across the cluster working with parallel operations. Though the underlying ideas may not be novel, a major strength of Hadoop is in how it has brought these principles together into an accessible and well-engineered platform.
Unlike classic relational databases (RDMS’s) which needs structured data on top of well-defined schemas, MapReduce is best optimized for semi-structured and unstructured data. In contrast to data conforming to rigid schemas, the requirement is distinctive that the data sets be processed to the map function as a series of key value pairs. The result of the map function is a collection of different key value pairs, and the reduce function works on aggregation to collect the final set of outcomes.
Hadoop empowers a standard specification (i.e., interface) for both the map and reduce functions, and implementations of these functions are typically referred to as mappers and reducers. A specific MapReduce job will consists of a numerous mappers and reducers, and it is very usual for many of these to be extremely simple and easy. The analyst focuses on expressing the transition between source and result data sets, and the Hadoop framework manages every aspect of job execution, parallelization, and coordination. The platform takes responsibility and handles every aspect of executing the processing across the data. Once the analyst defines the key criteria for the job, everything else becomes the overhead of the system.