Tools: Amazon Services
number of companies offer tools designed to help you get the most out of your
Hadoop implementation. Here’s a sampling:
Amazon Elastic MapReduce (Amazon EMR)
web service enables you to easily process vast amounts of data by provisioning
as much capacity as you need. Amazon EMR uses a hosted Hadoop framework running
on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2)
and Amazon Simple Storage Service (Amazon S3). Amazon EMR lets you analyse data
without having to worry about setting up, managing, or tuning Hadoop clusters.
Elastic Compute Cloud (EC2) is
basically a server on demand. After registering with AWS and EC2, credit card
details are all that's required to gain access to a dedicated virtual machine,
it's easy to run a variety of operating systems including Windows and many
variants of Linux on our server. Need more servers? Start more. Need more
powerful servers? Change to one of the higher specification (and cost) types
Simple Storage Service (S3) is a
storage service that provides a simple key/value storage model. Using web, command
line, or programmatic interfaces to create objects, which can be everything
from text files to images to MP3s, you can store and retrieve your data based
on a hierarchical model.
deployments of Hadoop applications like those offered by Amazon EMR are
somewhat different from on-premise deployments. You would follow these steps to
deploy an application on Amazon EMR:
Script a job flow in your
language of choice, including a SQL-like language such as Hive or Pig.
Upload your data and
application to Amazon S3, which provides reliable storage for your data.
Log in to the AWS Management
Console to start an Amazon EMR job flow by specifying the number and type of
Amazon EC2 instances that you want, as well as the location of the data on
Monitor the progress of your
job flow, and then retrieve the output from Amazon S3 using the AWS management
console, paying only for the resources that you consume.
Hadoop is an attractive platform for many kinds of workloads, it needs a
significant hardware footprint, especially when your data approaches scales of
hundreds of terabytes and beyond. This is where Amazon EMR is most practical:
as a platform for short term, Hadoop based analysis or for testing the viability
of a Hadoop-based solution before committing to an investment in on-premise