For creating new files in HDFS, a set of process would have to take place (refer to adjoining figure to see the components involved):
1. The client sends a request to the NameNode to create a new file. The NameNode identifies how many blocks are required, and the client is granted with a lease for creating these new file data blocks in the cluster. As part of the lease, the client is allotted with some limited amount of time to complete the creation task. (This time limit makes sure that storage space isn’t taken up by failed client applications.)
2. After this, the client writes the first copies of the data file blocks to the slave nodes using the granted lease assigned by the NameNode. The NameNode manges write requests and also identifes where the file blocks and their copies (replicas) need to be written, by balancing availability and performance. The first copy of a file block is written in one rack, wheras the second and third copies are written on a distinct rack than the first copy, but in distinct slave nodes in the same rack. This mechanism minimizes network traffic while guaranteeing that no data blocks are on the same point of failures.
3. After each block is written to HDFS, a specific process writes the rest of the replicas to the other slave nodes determined by the NameNode.
4. After the DataNode daemons make sure the file block replicas have been created, the client application closes the data file and notifies the NameNode, which then closes the granted lease.
For reading files from HDFS, the below process would have to take place (again, refer to the adjoining figure for the components involved):
1. The client sends a request to the NameNode for a file. The NameNode identifies which blocks are involved and chooses, on the basis of overall proximity of the blocks to one another and to the client, the most effective and efficient access path.
2. Now, the client accesses the blocks with help of addresses given by the NameNode.
Moreover, with different combinations of unevenly data-ingestion patterns (where few slave nodes might have more data written to them) or node failures, data is very likely to become unevenly distributed in the cluster across the racks and slave nodes. This uneven distribution can have a negative impact over performance since the demand on particular slave nodes will become unbalanced; nodes with little or less data won’t get fully used; and nodes with many blocks of extra data will get overused. (Note: This overuse and underuse are based on disk activity, not on CPU or RAM.) HDFS consists of a balancer utility to redistribute blocks from overused slave nodes to underused ones and maintaining the policy of putting blocks on different slave nodes and racks at the meantime. Hadoop adminis should regularly check HDFS health, and if uneven distribution of data is found there, then they should invoke the balancer utility.