of Key- Value Pair Data in Hadoop MapReduce
of all, let’s just clarify about what do we meant by saying “key-value” pairs
by understanding similar concepts in the Java standard API. The java.util.Map interface is the parent
of commonly used utility concrete classes such as HashMap and (through some
library reverse reengineering) evens the original Hashtable.
any Java Map object, it consists of a certain set of mappings from a given key
object/element of a specified type to a corresponding value element of a
potentially different type. A HashMap object could, for instance, consists of
mappings from a user's name (String) to his or her birthday (Date). But in the context
of Hadoop, we also consider data which comprises of keys that are might related
to associated values. This data is stored in such a manner that different
values of data set can be sorted and rearranged across a set of keys. If we are
about to use key/value data, it will make sense to ask questions such as the
Does a particular key has
any mapping existing in the data set?
What are the different
values related with a particular key?
What is the complete set of
also need to know some important features of key/value data which will become
apparent are as follows:
Keys need to be unique
whereas values need not be .
Every value need to be
associated with a particular key, but a key may could have no values.
· Careful definition of the key
is very important; deciding on whether or not the counts are applied with case
sensitivity will give different output.
give you more practical feel, let's thought of few real-world data that is
A mobile phone contact list relates
a name (key) to contact numbers (value).
A bank account uses unique
account number (key) to relate and identify the account details (value).
The index page of a book
relates a word (key) to the pages on which it occurs (value)
· On a computer file system,
filenames (keys) allow access to any sort of data, such as text, images, and
need to think of this in a way that, a key/value data is not some very
constrained or restricted model used only for high-end data mining but it is a very generalized approach, a common
model that is all around us. The bottom line is that if the data can be
expressed as key/value pairs, it can be processed by MapReduce.
using its key/value pair interface, enables a level of abstraction, whereby the
analysts only need to specify these transformations and Hadoop manages the
complex process of applying this to arbitrarily massive data sets.