Introduction to Pig in Hadoop

Java MapReduce programs and the Hadoop Distributed File System (HDFS) provide us with a powerful distributed computing framework, but they come with one major drawback — relying on them limits the use of Hadoop to Java programmers who can think in Map and Reduce terms when writing programs. More developers, data analysts, data scientists, and all-around good folks could leverage Hadoop if they had a way to harness the power of Map and Reduce while hiding some of the Map and Reduce complexities.

As with most things in life, where there’s a need, somebody is bound to come up with an idea meant to fill that need. A growing list of MapReduce abstractions is now on the market — programming languages and/or tools such as Hive and Pig, which hide the messy details of MapReduce so that a programmer can concentrate on the important work.

Though SQL is the common accepted language for querying structured data, some developers still prefer writing imperative scripts — scripts that define a set of operations that change the state of the data — and also want to have more data processing flexibility than what SQL or HiveQL provides. Again, this need led the engineers at Yahoo! Research to come up with a product meant to fulfill that need — and so Pig was born. Pig’s claim to fame was its status as a programming tool attempting to have the best of both worlds: a declarative query language inspired by SQL and a low-level procedural programming language that can generate MapReduce code. This lowers the bar when it comes to the level of technical knowledge needed to exploit the power of Hadoop.

Pig was initially developed at Yahoo! in 2006 as part of a research project tasked with coming up with ways for people using Hadoop to focus more on analyzing large data sets rather than spending lots of time writing Java MapReduce programs. The goal here was a familiar one: Allow users to focus more on what they want to do and less on how it’s done. Not long after, in 2007, Pig officially became an Apache project. As such, it is included in most Hadoop distributions.

And its name? That one’s easy to figure out. The Pig programming language is designed to handle any kind of data tossed its way — structured, semistructured, unstructured data, you name it. Pigs, of course, have a reputation for eating anything they come across.

Leave Comment