Pros and Cons of Hadoop
with any tool, it's important to understand when Hadoop is a good fit for the
problem in question. The architecture choices made within Hadoop enable it to
be the flexible and scalable data processing platform it is today. But, as with
most architecture or design choices, there are consequences that must be
understood. Primary amongst these is the fact that Hadoop is a batch processing
system. When you execute a job across a large data set, the framework will churn
away until the final results are ready. With a large cluster, answers across
even huge data sets can be generated relatively quickly, but the fact remains
that the answers are not generated fast enough to service impatient users.
Consequently, Hadoop alone is not well suited to low-latency queries such as
those received on a website, a real-time system, or a similar problem domain.
Hadoop is running jobs on large data sets, the overhead of setting up the job, determining
which tasks are run on each node, and all the other housekeeping activities that
are required is a trivial part of the overall execution time. But, for jobs on
small data sets, there is an execution overhead that means even simple
MapReduce jobs may take a minimum of 10 seconds.
haven't Google and Yahoo both been among the strongest proponents of this method
of computation, and aren't they all about such websites where response time is
critical? The answer is yes, and it highlights an important aspect of how to
incorporate Hadoop into any organization or activity or use it in conjunction
with other technologies in a way that exploits the strengths of each. In a paper,(http://research.google.com/archive/googlecluster.html)
,Google sketches how they utilized MapReduce at the time; after a web crawler
retrieved updated webpage data, MapReduce processed the huge data set, and from
this, produced the web index that a fleet of MySQL servers used to service
end-user search requests.
most common deployment model for Hadoop sees the HDFS and MapReduce clusters deployed
on the same set of servers. Each host that contains data and the HDFS component
to manage it also hosts a MapReduce component that can schedule and execute
data processing. When a job is submitted to Hadoop, it can use an optimization
process as much as possible to schedule data on the hosts where the data resides,
minimizing network traffic and maximizing performance. A part of the challenge
with Hadoop is in breaking down the overall problem into the best combination
of map and reduces functions.