Pig Design Principles in Hadoop
Latin is the programing platform which provides a language for Pig programs.
Pig helps to convert the Pig Latin script into MapReduce tasks that can be run
within Hadoop cluster. When it comes to Pig Latin, the development team
considered three core design principles to design it more elegantly:
Latin enables streamlined functions for communicating with Java MapReduce. It’s
is a kind of an abstraction, in simple words, that uses an easy way for the
creation of parallel programs on the Hadoop cluster for data flows and
analysis. Messy tasks may needs a series of interrelated data transformations —
like series are encoded as data flow sequences. Implementing data
transformation and flows as Pig Latin scripts in comparison to Java MapReduce programs makes these programs
much simpler and easier to write, understand, and maintain since:
We don’t have to write the
job in Java
We don’t have to think in
terms of MapReduce, and
We don’t need to come up
with custom code to support rich data types.
Latin provides a simpler language to exploit our Hadoop cluster, thus making it
easier for more people to leverage the power of Hadoop and become productive
may recall that the Pig Latin Compiler does his duty of transforming a Pig
Latin program into a series of Java MapReduce tasks. The trick is to ensure
that the compiler can perfectly optimize the execution of these Java MapReduce
jobs automatically, by just allowing the user to focus on semantics rather than
on how to optimize and access the data. For the people with SQL types out
there, this explanation will sound familiar. SQL is set up as a declarative
query that we use to access structured data stored in an RDBMS. The RDBMS engine
first translates the query to a data access method and then inspects the
statistics and builds a series of data access strategies. The cost-based
optimizer selects the most efficient strategy for execution.
Pig extensible so that developers can contribute and add customize functions to
address their specific business problems.
RDBMS data warehouses make use of the ETL data processing pattern, where we
extract data from outside sources, transform it to fit our operational needs,
and then load it into the end target, whether it’s an operational data store, a
data warehouse, or another variant of database. However, with big data, we
typically want to reduce the amount of data we have moving about, so we finally
end up bringing the processing to the data itself. The language for Pig data
flows, that’s why, takes a pass on the old ETL approach, and goes with ELT
instead: Extract the data from our various sources, load it into HDFS, and then
transform it as necessary to prepare the data for further analysis.