Pig Design Principles in Hadoop

Pig Latin is the programing platform which provides a language for Pig programs. Pig helps to convert the Pig Latin script into MapReduce tasks that can be run within Hadoop cluster. When it comes to Pig Latin, the development team considered three core design principles to design it more elegantly:

Keep it simple.

Pig Latin enables streamlined functions for communicating with Java MapReduce. It’s is a kind of an abstraction, in simple words, that uses an easy way for the creation of parallel programs on the Hadoop cluster for data flows and analysis. Messy tasks may needs a series of interrelated data transformations — like series are encoded as data flow sequences. Implementing data transformation and flows as Pig Latin scripts in comparison to  Java MapReduce programs makes these programs much simpler and easier to write, understand, and maintain since:

1)   We don’t have to write the job in Java

2)   We don’t have to think in terms of MapReduce, and

3)   We don’t need to come up with custom code to support rich data types.

Pig Latin provides a simpler language to exploit our Hadoop cluster, thus making it easier for more people to leverage the power of Hadoop and become productive sooner.

Make it smart.

We may recall that the Pig Latin Compiler does his duty of transforming a Pig Latin program into a series of Java MapReduce tasks. The trick is to ensure that the compiler can perfectly optimize the execution of these Java MapReduce jobs automatically, by just allowing the user to focus on semantics rather than on how to optimize and access the data. For the people with SQL types out there, this explanation will sound familiar. SQL is set up as a declarative query that we use to access structured data stored in an RDBMS. The RDBMS engine first translates the query to a data access method and then inspects the statistics and builds a series of data access strategies. The cost-based optimizer selects the most efficient strategy for execution.

Don’t limit development.

Make Pig extensible so that developers can contribute and add customize functions to address their specific business problems.

Traditional RDBMS data warehouses make use of the ETL data processing pattern, where we extract data from outside sources, transform it to fit our operational needs, and then load it into the end target, whether it’s an operational data store, a data warehouse, or another variant of database. However, with big data, we typically want to reduce the amount of data we have moving about, so we finally end up bringing the processing to the data itself. The language for Pig data flows, that’s why, takes a pass on the old ETL approach, and goes with ELT instead: Extract the data from our various sources, load it into HDFS, and then transform it as necessary to prepare the data for further analysis.

Leave Comment