Yarn’s Resource Manager
The most key component of YARN is the Resource Manager, which governs and maintains all the data processing resources in the Hadoop cluster. In other words, the Resource Manager is a dedicated scheduler who has a task to assigns resources to requesting applications.
Its very tasks are to maintain and manage a global view of all the resources in the cluster, managing resource requests, scheduling these requests, and then assigning the needed resources to the requesting application. The Resource Manager, acts as a critical component in a Hadoop cluster, and should run on a dedicated master node.
However, the Resource Manager is basically a pure scheduler; it basically relies on scheduler modules for the actual scheduling logic. We can choose from the same schedulers that were available in Hadoop 1, which have all been updated to work with YARN: FIFO (first in, first out), Capacity, and Fair Share.
The Resource Manager is completely agnostic with reference to both applications and frameworks — it does not provide any dogs in those particular hunts, in other words. It does not has any concept of map or reduce tasks, it is even doesn’t track the progress of jobs or their individual tasks, and it doesn’t handle failovers too. In short, the Resource Manager is a complete departure from the JobTracker daemon we looked at for Hadoop 1 environments. What the Resource Manager does do is schedule workloads, and it does his job very well. This high degree of separation in tasks — i.e. concentrating on only one aspect while ignoring rest of everything else — is exactly what makes YARN much more scalable and poweful, which is able to facilitates a generic platform for applications, and also able to support a multi-tenant Hadoop cluster — multi-tenant since various business units can share the same Hadoop cluster.
Yarn’s Node Manager
Every slave node has a Node Manager daemon, which acts as a slave for the Resource Manager. As with the TaskTracker, every slave node runs a service that binds it to the processing service (Node Manager) and the storage service (DataNode) that enable Hadoop to be a distributed system. Every Node Manager tracks the resource available for data processing on its slave node and sends regular reports to the Resource Manager.
The processing resources in a Hadoop cluster are utilized in bite-size pieces called containers. A container is a grouped set of all the resources needed to run an application: CPU cores, network bandwidth, memory, and disk space. A deployed container executes as an individual process on a slave node in a Hadoop cluster.
All container processes running on a slave node are initially provisioned, monitored, and tracked by that slave node’s Node Manager daemon.
Yarn’s Application Master
Every application running on the Hadoop cluster has its own, dedicated Application Master instance, which actually executes in a container process on a slave node.
The Application Master sends regular messages to the Resource Manager with its status and the state of the application’s resource required. On the basis of the results of the Resource Manager’s scheduling, it allocates container resource leases — basically reservations for the resources containers required — to the Application Master on specific slave nodes.
Job History Server
The Job History Server is another example of a function that the JobTracker used to handle, and it has been siphoned off as a self-contained daemon. Any client requests for a job history or the status of current jobs are served by the Job History Server.