“Simple” often sense as “elegant” when it comes to those remarkable architectural drawings for that new Silicon Valley mansion we have planned for when the money starts rolling in after we implement Hadoop. The same logic applies to software architecture. Pig is comprised of two components:
The language itself: The application programming language for Pig is popularly called as Pig Latin, a high-level language that allows us to write data processing and analysis programs.
The Pig Latin compiler: The Pig Latin compiler transfomrs the Pig Latin programs into executable code. The executable code is either in the form of MapReduce tasks or it can spawn a process where a virtual Hadoop instance is build to run the Pig code on a single node.
The sequencial processing of MapReduce programs enables Pig programs to work out data processing and analysis in parallel, leveraging Hadoop MapReduce and HDFS. Running the Pig job in the virtual Hadoop instance is a very effective strategy for testing our Pig scripts.
Pig Application flow
At its core, Pig Latin is a dataflow language, where we define a data stream and a series of transformations that are applied to the data as it flows through our application. This is in contrast to a control flow language (like C or Java), where we write a series of instructions. In control flow languages, we use constructs like loops and conditional logic (like an if statement). We won’t find loops and if statements in Pig Latin. Here is a simple pig syntax :
A = LOAD 'mindstick_file.txt';
B = GROUP ... ;
C= FILTER ...;
STORE C INTO 'Results';
Load: We first load (LOAD) the data we want to manipulate. As in a typical MapReduce job, that data is stored in HDFS. For a Pig program to access the data, we first tell Pig what file or files to use. For that task, we use the LOAD 'data_file' command.
Here, 'mindstick_file' can specify either an HDFS data file or a HDFS directory. If a directory is specified, every file located in that directory are loaded into the program.
If the data is stored in a file format that isn’t natively accessible to Pig, we can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in (and interpret) the data.
Transform: We run the data through a set of transformations that, way under the hood and far removed from anything we have to concern ourselves with, are translated into a set of Map and Reduce tasks. The transformation logic is place where all the data manipulation processing happens.
Here, we can FILTER out rows that aren’t of our interest, JOIN two sets of data files, GROUP data to build aggregations, ORDER results, and do many more things.
Dump: Finally, we dump (DUMP) the results to the screen
Store (STORE) the results in a file somewhere.