The reason that folks such as chief financial officers are excited by the thought of using Hadoop is that it lets us store massive amounts of data across a cluster of low cost commodity servers — that’s music to the ears of financially minded people. Well, HBase offers the same economic bang for the buck — it’s a distributed data store, which leverages a network attached cluster of low-cost commodity servers to store and persist data.
HBase persists data by storing it in HDFS, but alternate storage arrangements are possible. For example, HBase can be deployed in standalone mode in the cloud (typically for educational purposes) or on expensive servers if the use case warrants it.
In most cases, though, HBase implementations look pretty much like the
one shown in here.
As with the data model, understanding the components of the architecture is critical for successful HBase cluster deployment. So in this post and in few upcoming posts we will examine the key components one by one. So let’s start with the Region Servers, which is the most basic component of the Architecture.
RegionServers are the software processes (usually called daemons) , which we activate to store and retrieve data in HBase. In production environments, every RegionServer is deployed on its own dedicated compute node. When we start using HBase, we create a table and then begin storing and retrieving our data volumes. Although, at some point — and perhaps quite fast in big data use cases — the table grows beyond a configurable limit. At this situation, the HBase system automatically splits the table and distributes the data load to another RegionServer.
In this process, often also known as auto-sharding, HBase automatically scales as we keep adding data to the system — a huge advantage as compared to majority of the database management systems, which need manual intervention to scale the overall system beyond a single server. With HBase, as long as we have in the rack another spare server that’s configured, scaling is automatic!
Why set a limit on tables and then split them? After all, HDFS is the underlying storage mechanism, so all available disks in the HDFS cluster are available for storing our tables. (Not counting the replication factor, of course.) If we have an entire cluster at our disposal, why limit ourself to one RegionServer to manage our tables?
Simple. We may have any number of tables large or small and we will want HBase to leverage all available RegionServers when managing our data. We want to take full advantage of the cluster’s compute performance.
Furthermore, with many clients accessing our HBase system, we will want to use many RegionServers to meet the demand. HBase addresses all of these concerns for us and scales automatically in terms of storage capacity and compute power.