The core concept of HDFS is that it can be made up of dozens, hundreds, or even thousands of individual computers, where the system’s files are stored in directly attached disk drives. Each of these individual computers is a self-contained server with its own memory, CPU, disk storage, and installed operating system (typically Linux, though Windows is also supported). Technically speaking, HDFS is a user-space-level file system because it lives on top of the file systems that are installed on all individual computers that make up the Hadoop cluster.
A typical Hadoop cluster is made up of two classes of servers: slave nodes
and master nodes.
1. Slave nodes: It is where the data is stored and processed.
2. Master nodes: It govern the management of the Hadoop cluster.
On each of the master nodes and slave nodes, HDFS runs special services and stores raw data to capture the state of the file system. In the case of the slave nodes, the raw data consist of the blocks stored on the node, and with the master nodes, the raw data consists of metadata that maps data blocks to the files stored in HDFS.
In a Hadoop cluster, each data node (also known as a slave node) runs a background process named DataNode. This background process (also known as a daemon) keeps track of the slices of data that the system stores on its computer. It regularly talks to the master server for HDFS (known as the NameNode) to report on the health and status of the locally stored data.
Data blocks are stored as raw files in the local file system. From the perspective of a Hadoop user, we have no idea which of the slave nodes has the pieces of the file we need to process. From within Hadoop, we don’t see data blocks or how they’re distributed across the cluster — all we see is a listing of files in HDFS. The complexity of how the file blocks are distributed across the cluster is hidden from us — we don’t know how complicated it all is, and we don’t need to know. Actually, the slave nodes themselves don’t even know what’s inside the data blocks they’re storing. It’s the NameNode server that knows the mappings of which data blocks compose the files stored in HDFS.
One core design principle of HDFS is the concept of minimizing the cost of the individual slave nodes by using commodity hardware components. For massively scalable systems, this idea is a sensible one because costs escalate quickly when we need hundreds or thousands of slave nodes.