Commercially available distributions of Hadoop offer different combinations of open source components from the Apache Software Foundation and from several other places — the fact is that the different components have been combined into a single packaged product, saving our effort of having to integrate our own set of assembled components. In addition to open source software, vendors specifically provide proprietary software, support, consulting services, training and other related services.
So, how do we choose the right Hadoop distribution from the numerous options that would serve our purpose? Not all of the Hadoop distributions have the common components (but, they all consists of Hadoop’s core capabilities), and not all components in a particular distribution are compatible with other distributions.
The criteria for selecting the most appropriate distribution can be articulated as
this set of important questions:
•What do we want to achieve with Hadoop?
•How can we use Hadoop to gain business insight?
•What business problems do we want to solve?
•What data will be analyzed?
•Are we willing to use proprietary components, or do we prefer open source offerings?
•Is the Hadoop infrastructure that we are considering flexible enough for all our use cases?
•What existing tools will we want to integrate and operate with the Hadoop?
•Do the admins required management tools? (Since, Hadoop’s core distribution doesn’t include any administrative tool.)
•Will the package or plan we choose facilitates us to move to another product without obstacles like vendor lock-in? (Application code which we not able to transfer to other distributions or data stored in proprietary formats represents good examples of lock-in.)
•Will the distribution we are choosing meet our future requirements, in so far as we are able to anticipate those needs?
Basic approach to compare various distributions is to create a feature matrix — a table or chart that provides the specifications and features of every distribution we are considering. Our consideration can be dependent on the set of features and specs that are best suit our requirements for a specific business problems.
On the other hand, if our requirements consists of prototyping and experimentation, considering the latest official Apache Hadoop distribution will prove to be the right approach. The most updated releases certainly have the newest most exciting features, but if we want stability we don’t need excitement. For stability, look for an older stable release that’s been available long enough to have some incremental releases (these typically include bug fixes and minor features).
Whenever you need to think about open source Hadoop distributions, give a moment’s thought to the analogy of open source fidelity — the degree to which a specific distribution is compatible with the open source components on which it depends. High degree of fidelity enables integration with other products which are designed to be compatible with these open source components.
The open source approach to software development itself is an important part of our Hadoop plans because it promotes compatibility with a host of third party tools that we can leverage in our own Hadoop deployment. The open source approach also enables engagement with the Apache Hadoop community, which gives us, in turn, the opportunity to tap into a deeper pool of skills and innovation to enrich our Hadoop experience.