YARN’s Resource Management
most key component of YARN is the Resource Manager, which governs and maintains
all the data processing resources in the Hadoop cluster. In other words, the
Resource Manager is a dedicated scheduler who has a task to assigns resources
to requesting applications.
very tasks are to maintain and manage a global view of all the resources in the
cluster, managing resource requests, scheduling these requests, and then assigning
the needed resources to the requesting application. The Resource Manager, acts
as a critical component in a Hadoop cluster, and should run on a dedicated
the Resource Manager is basically a pure scheduler; it basically relies on
scheduler modules for the actual scheduling logic. We can choose from the same
schedulers that were available in Hadoop 1, which have all been updated to work
with YARN: FIFO (first in, first out), Capacity, and Fair Share.
Resource Manager is completely agnostic with reference to both applications and
frameworks — it does not provide any dogs in those particular hunts, in other
words. It does not has any concept of map or reduce tasks, it is even doesn’t
track the progress of jobs or their individual tasks, and it doesn’t handle
failovers too. In short, the Resource Manager is a complete departure from the
JobTracker daemon we looked at for Hadoop 1 environments. What the Resource Manager
does do is schedule workloads, and it does his job very well. This high degree
of separation in tasks — i.e. concentrating on only one aspect while ignoring
rest of everything else — is exactly what makes YARN much more scalable and
poweful, which is able to facilitates a
generic platform for applications, and also able to support a multi-tenant Hadoop
cluster — multi-tenant since various business units can share the same Hadoop
slave node has a Node Manager daemon, which acts as a slave for the Resource
Manager. As with the TaskTracker, every slave node runs a service that binds it
to the processing service (Node Manager) and the storage service (DataNode)
that enable Hadoop to be a distributed system. Every Node Manager tracks the resource
available for data processing on its slave node and sends regular reports to
the Resource Manager.
processing resources in a Hadoop cluster are utilized in bite-size pieces
called containers. A container is a grouped set of all the resources needed to
run an application: CPU cores, network bandwidth, memory, and disk space. A
deployed container executes as an individual process on a slave node in a
container processes running on a slave node are initially provisioned, monitored,
and tracked by that slave node’s Node Manager daemon.
application running on the Hadoop cluster has its own, dedicated Application
Master instance, which actually executes in a container process on a slave node.
Master sends regular messages to the Resource Manager with its status and the
state of the application’s resource required. On the basis of the results of the Resource
Manager’s scheduling, it allocates container resource leases — basically
reservations for the resources containers required — to the Application Master
on specific slave nodes.
Job History Server is another example of a function that the JobTracker used to
handle, and it has been siphoned off as a self-contained daemon. Any client
requests for a job history or the status of current jobs are served by the Job