HDFS Architecture in Hadoop
core concept of HDFS is that it can be made up of dozens, hundreds, or even
thousands of individual computers, where the system’s files are stored in directly
attached disk drives. Each of these individual computers is a self-contained
server with its own memory, CPU, disk storage, and installed operating system
(typically Linux, though Windows is also supported). Technically speaking, HDFS
is a user-space-level file system because it lives on top of the file systems
that are installed on all individual computers that make up the Hadoop cluster.
typical Hadoop cluster is made up of two classes of servers: slave nodes and
nodes: It is where the data is stored and
govern the management of the Hadoop cluster.
each of the master nodes and slave nodes, HDFS runs special services and stores
raw data to capture the state of the file system. In the case of the slave
nodes, the raw data consist of the blocks stored on the node, and with the
master nodes, the raw data consists of metadata that maps data blocks to the
files stored in HDFS.
Hadoop cluster, each data node (also known as a slave node) runs a background
process named DataNode. This background process (also known as a daemon) keeps
track of the slices of data that the system stores on its computer. It
regularly talks to the master server for HDFS (known as the NameNode) to report
on the health and status of the locally stored data.
blocks are stored as raw files in the local file system. From the perspective
of a Hadoop user, we have no idea which of the slave nodes has the pieces of
the file we need to process. From within Hadoop, we don’t see data blocks or
how they’re distributed across the cluster — all we see is a listing of files
in HDFS. The complexity of how the file blocks are distributed across the
cluster is hidden from us — we don’t know how complicated it all is, and we
don’t need to know. Actually, the slave nodes themselves don’t even know what’s
inside the data blocks they’re storing. It’s the NameNode server that knows the
mappings of which data blocks compose the files stored in HDFS.
core design principle of HDFS is the concept of minimizing the cost of the
individual slave nodes by using commodity hardware components. For massively
scalable systems, this idea is a sensible one because costs escalate quickly
when we need hundreds or thousands of slave nodes.