As with any tool, it's important to understand when Hadoop is a good fit for the problem in question. The architecture choices made within Hadoop enable it to be the flexible and scalable data processing platform it is today. But, as with most architecture or design choices, there are consequences that must be understood. Primary amongst these is the fact that Hadoop is a batch processing system. When you execute a job across a large data set, the framework will churn away until the final results are ready. With a large cluster, answers across even huge data sets can be generated relatively quickly, but the fact remains that the answers are not generated fast enough to service impatient users. Consequently, Hadoop alone is not well suited to low-latency queries such as those received on a website, a real-time system, or a similar problem domain.

When Hadoop is running jobs on large data sets, the overhead of setting up the job, determining which tasks are run on each node, and all the other housekeeping activities that are required is a trivial part of the overall execution time. But, for jobs on small data sets, there is an execution overhead that means even simple MapReduce jobs may take a minimum of 10 seconds.

But haven't Google and Yahoo both been among the strongest proponents of this method of computation, and aren't they all about such websites where response time is critical? The answer is yes, and it highlights an important aspect of how to incorporate Hadoop into any organization or activity or use it in conjunction with other technologies in a way that exploits the strengths of each. In a paper,( ,Google sketches how they utilized MapReduce at the time; after a web crawler retrieved updated webpage data, MapReduce processed the huge data set, and from this, produced the web index that a fleet of MySQL servers used to service end-user search requests.

The most common deployment model for Hadoop sees the HDFS and MapReduce clusters deployed on the same set of servers. Each host that contains data and the HDFS component to manage it also hosts a MapReduce component that can schedule and execute data processing. When a job is submitted to Hadoop, it can use an optimization process as much as possible to schedule data on the hosts where the data resides, minimizing network traffic and maximizing performance. A part of the challenge with Hadoop is in breaking down the overall problem into the best combination of map and reduces functions.

  Modified On Mar-14-2018 03:29:15 AM

Leave Comment