Distributed File System (HDFS)
is a file system unlike most of us may have encountered before. It is not a
POSIX compliant file system, which basically means it does not provide the same
guarantees as a regular file system. It is also a distributed file system,
meaning that it spreads storage across multiple nodes; lack of such an
efficient distributed file system was a limiting factor in some historical
HDFS have several key features and properties.
HDFS stores files in blocks typically at least 64 MB in size, much larger than
the 4-32 KB seen in most file systems. HDFS is optimized for throughput over
latency; it is very efficient at streaming read requests for large files but
poor at seek requests for many small ones. HDFS is optimized for workloads that
are generally of the write-once and read-many type. Each storage node runs a
process called a Data Node that manages the blocks on that host, and these are
coordinated by a master Name Node process running on a separate host.
of handling disk failures by having physical redundancies in disk arrays or
similar strategies, HDFS uses replication. Each of the blocks
comprising a file is stored on multiple nodes within the cluster, and the HDFS
Name Node constantly monitors reports sent by each Data Node to ensure that
failures have not dropped any block below the desired replication factor. If
this does happen, it schedules the addition of another copy within the cluster.
can be used without MapReduce, as it is intrinsically a large-scale data
storage platform. Though MapReduce can read data from non-HDFS sources, the
nature of its processing aligns so well with HDFS that using the two together
is by far the most common use case.
a MapReduce job is executed, Hadoop needs to decide where to execute the code
most efficiently to process the data set. If the MapReduce-cluster hosts all
pull their data from a single storage host or an array, it largely doesn't
matter as the storage system is a shared resource that will cause contention.
But if the storage system is HDFS, it allows MapReduce to execute data
processing on the node holding the data of interest, building on the principle
of it being less expensive to move data processing than the data itself.
MapReduce model would (in a somewhat simplified and idealized way) perform the
processing in a map function on each piece of data on a host where the data
resides in HDFS and then reuse the cluster in the reduce function to collect
the individual results into the final result set.