First of all, let’s just clarify about what do we meant by saying “key-value” pairs by understanding similar concepts in the Java standard API. The
java.util.Map interface is the parent of commonly used utility concrete classes such as
HashMap and (through some library reverse reengineering) evens the original
For any Java Map object, it consists of a certain set of mappings from a given key object/element of a specified type to a corresponding value element of a potentially different type. A HashMap object could, for instance, consists of mappings from a user's name (String) to his or her birthday (Date). But in the context of Hadoop, we also consider data which comprises of keys that are might related to associated values. This data is stored in such a manner that different values of data set can be sorted and rearranged across a set of keys. If we are about to use key/value data, it will make sense to ask questions such as the following:
· Does a particular key has any mapping existing in the data set?
· What are the different values related with a particular key?
· What is the complete set of keys?
We also need to know some important features of key/value data which will
become apparent are as follows:
· Keys need to be unique whereas values need not be .
· Every value need to be associated with a particular key, but a key may could have no values.
· Careful definition of the key is very important; deciding on whether or not the counts are applied with case sensitivity will give different output.
To give you more practical feel, let's thought of few real-world data that is key/value pair:
· A mobile phone contact list relates a name (key) to contact numbers (value).
· A bank account uses unique account number (key) to relate and identify the account details (value).
· The index page of a book relates a word (key) to the pages on which it occurs (value)
· On a computer file system, filenames (keys) allow access to any sort of data, such as text, images, and sound (values).
We need to think of this in a way that, a key/value data is not some very constrained or restricted model used only for high-end data mining but it is a very generalized approach, a common model that is all around us. The bottom line is that if the data can be expressed as key/value pairs, it can be processed by MapReduce.
MapReduce, using its key/value pair interface, enables a level of abstraction, whereby the analysts only need to specify these transformations and Hadoop manages the complex process of applying this to arbitrarily massive data sets.