HDFS is a file system unlike most of us may have encountered before. It is not a POSIX compliant file system, which basically means it does not provide the same guarantees as a regular file system. It is also a distributed file system, meaning that it spreads storage across multiple nodes; lack of such an efficient distributed file system was a limiting factor in some historical technologies.

 HDFS have several key features and properties. HDFS stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most file systems. HDFS is optimized for throughput over latency; it is very efficient at streaming read requests for large files but poor at seek requests for many small ones. HDFS is optimized for workloads that are generally of the write-once and read-many type. Each storage node runs a process called a Data Node that manages the blocks on that host, and these are coordinated by a master Name Node process running on a separate host.

Instead of handling disk failures by having physical redundancies in disk arrays or similar strategies, HDFS uses replication. Each of the blocks comprising a file is stored on multiple nodes within the cluster, and the HDFS Name Node constantly monitors reports sent by each Data Node to ensure that failures have not dropped any block below the desired replication factor. If this does happen, it schedules the addition of another copy within the cluster.

HDFS can be used without MapReduce, as it is intrinsically a large-scale data storage platform. Though MapReduce can read data from non-HDFS sources, the nature of its processing aligns so well with HDFS that using the two together is by far the most common use case.

When a MapReduce job is executed, Hadoop needs to decide where to execute the code most efficiently to process the data set. If the MapReduce-cluster hosts all pull their data from a single storage host or an array, it largely doesn't matter as the storage system is a shared resource that will cause contention. But if the storage system is HDFS, it allows MapReduce to execute data processing on the node holding the data of interest, building on the principle of it being less expensive to move data processing than the data itself.

The MapReduce model would (in a somewhat simplified and idealized way) perform the processing in a map function on each piece of data on a host where the data resides in HDFS and then reuse the cluster in the reduce function to collect the individual results into the final result set.

  Modified On Dec-16-2017 12:34:39 AM

Leave Comment