A number of companies offer tools designed to help you get the most out of your Hadoop implementation. Here’s a sampling:
The Amazon Elastic MapReduce (Amazon EMR) web service enables you to easily process vast amounts of data by provisioning as much capacity as you need. Amazon EMR uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Amazon EMR lets you analyse data without having to worry about setting up, managing, or tuning Hadoop clusters.
Amazon's Elastic Compute Cloud (EC2) is basically a server on demand. After registering with AWS and EC2, credit card details are all that's required to gain access to a dedicated virtual machine, it's easy to run a variety of operating systems including Windows and many variants of Linux on our server. Need more servers? Start more. Need more powerful servers? Change to one of the higher specification (and cost) types offered.
Amazon's Simple Storage Service (S3) is a storage service that provides a simple key/value storage model. Using web, command line, or programmatic interfaces to create objects, which can be everything from text files to images to MP3s, you can store and retrieve your data based on a hierarchical model.
Cloud-based deployments of Hadoop applications like those offered by Amazon EMR are somewhat different from on-premise deployments. You would follow these steps to deploy an application on Amazon EMR:
1. Script a job flow in your language of choice, including a SQL-like language such as Hive or Pig.
2. Upload your data and application to Amazon S3, which provides reliable storage for your data.
3. Log in to the AWS Management Console to start an Amazon EMR job flow by specifying the number and type of Amazon EC2 instances that you want, as well as the location of the data on Amazon S3.
4. Monitor the progress of your job flow, and then retrieve the output from Amazon S3 using the AWS management console, paying only for the resources that you consume.
Though Hadoop is an attractive platform for many kinds of workloads, it needs a significant hardware footprint, especially when your data approaches scales of hundreds of terabytes and beyond. This is where Amazon EMR is most practical: as a platform for short term, Hadoop based analysis or for testing the viability of a Hadoop-based solution before committing to an investment in on-premise hardware.