Introduction and RegionServers(Part-1)
reason that folks such as chief financial officers are excited by the thought
of using Hadoop is that it lets us store massive amounts of data across a
cluster of low cost commodity servers — that’s music to the ears of financially
minded people. Well, HBase offers the same economic bang for the buck — it’s a
distributed data store, which leverages a network attached cluster of low-cost
commodity servers to store and persist data.
persists data by storing it in HDFS, but alternate storage arrangements are
possible. For example, HBase can be deployed in standalone mode in the cloud
(typically for educational purposes) or on expensive servers if the use case
most cases, though, HBase implementations look pretty much like the one shown
with the data model, understanding the components of the architecture is critical
for successful HBase cluster deployment. So in this post and in few upcoming
posts we will examine the key components one by one. So let’s start with the
Region Servers, which is the most basic component of the Architecture.
RegionServers are the software processes
(usually called daemons) , which we activate to store and retrieve data in
HBase. In production environments, every RegionServer is deployed on its own dedicated
compute node. When we start using HBase, we create a table and then begin
storing and retrieving our data volumes. Although, at some point — and perhaps
quite fast in big data use cases — the table grows beyond a configurable limit.
At this situation, the HBase system automatically splits the table and distributes the
data load to another RegionServer.
this process, often also known as auto-sharding, HBase automatically
scales as we keep adding data to the
system — a huge advantage as compared to majority of the database management
systems, which need manual intervention to scale the overall system beyond a
single server. With HBase, as long as we have in the rack another spare server
that’s configured, scaling is automatic!
set a limit on tables and then split them? After all, HDFS is the underlying storage
mechanism, so all available disks in the HDFS cluster are available for storing
our tables. (Not counting the replication factor, of course.) If we have an
entire cluster at our disposal, why limit ourself to one RegionServer to manage
We may have any number of tables large or small and we will want HBase to
leverage all available RegionServers when managing our data. We want to take
full advantage of the cluster’s compute performance.
Furthermore, with many clients accessing our
HBase system, we will want to use many RegionServers to meet the demand. HBase
addresses all of these concerns for us and scales automatically in terms of
storage capacity and compute power.