Introduction to Pig in Hadoop
MapReduce programs and the Hadoop Distributed File System (HDFS) provide us
with a powerful distributed computing framework, but they come with one major
drawback — relying on them limits the use of Hadoop to Java programmers who can
think in Map and Reduce terms when writing programs. More developers, data
analysts, data scientists, and all-around good folks could leverage Hadoop if
they had a way to harness the power of Map and Reduce while hiding some of the
Map and Reduce complexities.
with most things in life, where there’s a need, somebody is bound to come up
with an idea meant to fill that need. A growing list of MapReduce abstractions is
now on the market — programming languages and/or tools such as Hive and Pig,
which hide the messy details of MapReduce so that a programmer can concentrate on
the important work.
SQL is the common accepted language for querying structured data, some
developers still prefer writing imperative scripts — scripts that define a set
of operations that change the state of the data — and also want to have more
data processing flexibility than what SQL or HiveQL provides. Again, this need
led the engineers at Yahoo! Research to come up with a product meant to fulfill
that need — and so Pig was born. Pig’s claim to fame was its status as a
programming tool attempting to have the best of both worlds: a declarative query
language inspired by SQL and a low-level procedural programming language that
can generate MapReduce code. This lowers the bar when it comes to the level of
technical knowledge needed to exploit the power of Hadoop.
was initially developed at Yahoo! in 2006 as part of a research project tasked
with coming up with ways for people using Hadoop to focus more on analyzing
large data sets rather than spending lots of time writing Java MapReduce
programs. The goal here was a familiar one: Allow users to focus more on what
they want to do and less on how it’s done. Not long after, in 2007, Pig
officially became an Apache project. As such, it is included in most Hadoop
its name? That one’s easy to figure out. The Pig programming language is
designed to handle any kind of data tossed its way — structured,
semistructured, unstructured data, you name it. Pigs, of course, have a
reputation for eating anything they come across.