Apache Hadoop Eco-system
are several other open source components that are typically seen in a Hadoop
deployment. Hadoop is more than MapReduce and HDFS, it also consists of a “family
of related projects” (an ecosystem, really) for distributed computing
for massive large-scale data processing systems. Many (but not all) of the
projects are hosted by the Apache Software Foundation. Some of the most popular
and widely used projects are stated as under in the following list of projects:
Ambari is an integrated and well-designed set of Hadoop administrative tools
for installing, monitoring, and managing a Hadoop cluster. It also consists
of are tools for adding or removing
: It is a framework for the effective
operation of serialization mechanism (a kind of transformation) of data sets
into a compact(compressed) binary format.
It is an effective data flow service for the movement of massive volumes of
data logs into Hadoop System.
It is one of the widely used and popular distributed columnar database that
uses Hadoop distributed file system (HDFS) for its underlying storage. With
HBase, we can store and hold data in massively large tables with variable
column structures and schematics.
It is also an effective service for providing a relational view of data stored
in Hadoop, including a standard approach for tabular data.
Hive is another elegant application; it is a distributed data warehouse for data
set of volumes which are stored in HDFS. It also uses a query language that is based
on SQL called “HiveQL”.
Hue is an effective Hadoop interface for administrative operations which
consists of handy GUI tools used for browsing files, responsible for issuing queries
for Hive and Pig, and developing workflows for Oozie.
Apache Mahout is a library for machine understandable and statistical
algorithms that are computed within in MapReduce and can execute natively on
9. Oozie: It a very useful workflow management tool
that manges the scheduling tasks and also responsible for chaining them together
for Hadoop applications
Pig is a robust platform for the analysis of very large data sets that executes
on top of HDFS and consists of an infrastructure layer of a compiler that
generate sequences of MapReduce programs and a programming language layer
comprising of a query language known as
:It is a very handy tool for effective movement of massive amounts of data
between relational databases (RDBM’s) and HDFS.
It is one of the simple and useful interfaces for the centralized coordination
and processing of services (such as naming, configuration, and synchronization)
used by distributed applications.
software development, the Hadoop ecosystem and its commercially available
distributions are consistently evolving, with new and more improved technologies
and tools emerging all the time.