There are several other open source components that are typically seen in a Hadoop deployment. Hadoop is more than MapReduce and HDFS, it also consists of a “family of related projects” (an ecosystem, really) for distributed computing for massive large-scale data processing systems. Many (but not all) of the projects are hosted by the Apache Software Foundation. Some of the most popular and widely used projects are stated as under in the following list of projects:
1. Ambari: Ambari is an integrated and well-designed set of Hadoop administrative tools for installing, monitoring, and managing a Hadoop cluster. It also consists of are tools for adding or removing slave nodes.
2. Avro : It is a framework for the effective operation of serialization mechanism (a kind of transformation) of data sets into a compact(compressed) binary format.
3. Flume: It is an effective data flow service for the movement of massive volumes of data logs into Hadoop System.
4. HBase: It is one of the widely used and popular distributed columnar database that uses Hadoop distributed file system (HDFS) for its underlying storage. With HBase, we can store and hold data in massively large tables with variable column structures and schematics.
5. HCatalog: It is also an effective service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data.
6. Hive: Hive is another elegant application; it is a distributed data warehouse for data set of volumes which are stored in HDFS. It also uses a query language that is based on SQL called “HiveQL”.
7. Hue: Hue is an effective Hadoop interface for administrative operations which consists of handy GUI tools used for browsing files, responsible for issuing queries for Hive and Pig, and developing workflows for Oozie.
8. Mahout: Apache Mahout is a library for machine understandable and statistical algorithms that are computed within in MapReduce and can execute natively on Hadoop.
9. Oozie: It a very useful workflow management tool that manges the scheduling tasks and also responsible for chaining them together for Hadoop applications
10. Pig: Pig is a robust platform for the analysis of very large data sets that executes on top of HDFS and consists of an infrastructure layer of a compiler that generate sequences of MapReduce programs and a programming language layer comprising of a query language known as Pig Latin.
11. Sqoop :It is a very handy tool for effective movement of massive amounts of data between relational databases (RDBM’s) and HDFS.
12.ZooKeeper: It is one of the simple and useful interfaces for the centralized coordination and processing of services (such as naming, configuration, and synchronization) used by distributed applications.
In software development, the Hadoop ecosystem and its commercially available distributions are consistently evolving, with new and more improved technologies and tools emerging all the time.