Just to be clear, storing data in HDFS is not entirely the same as saving files on your personal computer. In fact, quite a number of differences exist — most having to do with optimizations that make HDFS able to scale out easily across thousands of slave nodes and perform well with batch workloads.
The most noticeable difference initially is the size of files. Hadoop is designed to work best with a modest number of extremely large files. Average file sizes that are larger than 500MB are the norm.
Here’s an additional bit of background information on how data is stored: HDFS has a Write Once, Read Often model of data access. That means the contents of individual files cannot be modified, other than appending new data to the end of the file.
Don’t worry, though: There are still lots we can do with HDFS files,
· Create a new file
· Append content to the end of a file
· Delete a file
· Rename a file
· Modify file attributes like owner
When we store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster. This is an entirely normal thing to do, as all file systems break down files into small segment of blocks and then storing them to disk. HDFS doesn’t have any idea (and even doesn’t care) what is going to be stored inside these file, therefore raw files are not split on the basis of the rules that we humans would understand. We humans, for instance, would always need record boundaries — the lines which shows where a record begins and ends — to be respected. But, HDFS is blissfully not aware that the final record in one block might be only a partial record, and the rest of its content shunted off to the following block. HDFS always wanted to make sure that files are split into evenly sized blocks that resemble the predefined block size for the Hadoop instance. Not each file we required to store is an exactly sized as our system’s block size, therefore, the final data block for a file uses only as much space as is needed.
This concept of storing a file as a set of blocks is very consistent with how a normal file systems work. But what the different about HDFS is the scale. A typical block size that we would see in a file system under Linux is 4KB, whereas a typical block size in Hadoop is 128MB. This value is configurable, and it can be customized, as both a new system default and a custom value for individual files.