Data Replication in Hadoop: Replicating Data Blocks
(Part – 1)
In HDFS, the Data block size
needs to be large enough to warrant the resources dedicated to an individual
unit of data processing On the other hand, the block size can’t be so large
that the system is waiting a very long time for a single unit of data
processing to complete its work. Both of these recommendations but obvious depend
on the kinds of work being done on the data blocks.
is designed to store data on inexpensive or less expensive, and much
unreliable, hardware. Moreover, Inexpensive has an attractive ring to it from
infrastructure point of view; more often it does raise concerns about the
reliability of the system as a whole atomic unit, typically to make sure the
high availability of the data. Planned ahead for disaster, the minds behind
HDFS made the fact that to set up the system so that it would store three
(count ’em — three) copies of every data block.
also assumes that each disk drive and each slave node is inherently unreliable,
so very smartly, care must be taken in choosing where the three copies of the
data blocks are stored. Below figure shows us how data blocks from a massive
file are fragmented across the Hadoop cluster — meaning they are evenly distributed
across the slave nodes in a way that a copy of the block will still be available
regardless of disk, node, or rack failures.
file shown here in the figure has five data blocks, labelled A, B, C, D and E.
If we take a closer look, we can see two important things from that:
A particular cluster is
comprised of two racks with two nodes at each.
2. And, the three copies
(instances) of every data block have been spread out across the different slave
component in the Hadoop cluster is seen as a potential point of Failure, so
when HDFS stores and distributes the
replicas of the original blocks of the
files across the Hadoop cluster, it tries to make sure that the block replicas are
stored in different failure points.
instance, take a careful observation at Block A. At the instance it needed to
be stored, Slave Node 3 was considered, and the very first copy of Block A was
stored there. For multiple rack systems, HDFS then identifies that the rest of
the two copies of block A needed to be stored in a different rack. Hence, the
second copy of block A is going to be stored on Slave Node 1. Now, the final
copy can be stored on the same rack as the second copy, but not on the same
slave node, so it gets stored on Slave Node 2.