Big Data:What is Sparse Data in HBase?

As we might have guessed, the Google’s BigTable distributed data storage system(DDSS) was designed to meet the demands of big data. Now, big data applications store massive amount of data but big data content is also often variable.

Imagine a traditional table in MindStick’s database storing client contact information,

Customer ID

First Name

Middle Name

Last Name







45 DL NY






16 TL CA


A company or individual may require a complete data record for each of its customers or constituents. A good example is our doctor, who needs all our contact information in order to provide us with proper care. Other organizations or individuals may need only partial contact information or may need to learn that information over time. For example, a software service provider company like MindStick may process phone calls or e-mail messages for service requests and requirements. Clients sometimes may or may not want to give service companies all their contact information. But, with every interaction over time, organizations may learn more about their clients that will enable them to provide better service — by issuing proactive service alerts, for instance.

In this context, we mean to say that sparse means that fields in rows can be empty or NULL but that doesn’t bring HBase to a screeching halt. HBase can handle the fact that we don’t (yet) know Hank Moody’s middle name and e-mail address, for example.

Let’s move to our next example, a database for storing satellite images. It turns out that Google uses BigTable technique to store satellite imagery data of the earth. In almost each case, whenever imagery is stored, metadata is also created and stored with it. This metadata may include the street address of the image or may be only the latitude and longitude if the picture is captured from the wilderness. Since, the metadata is variable in content, so some fields will be NULL — and that’s fine.

In both above examples, the data sets that are collected can be extremely massive — especially in the satellite data example. Imagery databases are almost always measured in terabytes (TB) or sometimes in petabytes (PB). We’ve already know that HBase is designed and structured for storing big data, but it’s also designed and equipped for storing sparse data records at no cost. This is an important factor when we are using big data applications! Storing a few NULL records over a million rows is wasteful, but try to think about the waste over a quadrillion rows!  Just Imagine!! Thankfully, this was a key consideration for Google designers and the HBase community. Sparse data is supported with no waste of costly storage space.

And it doesn’t stop there. Consider the power of a schema-less data store. The above table shows us a classic customer contact table. When organizations design these tables, they know up front what they want to store. In simple words their schema is fixed; it’s defined way before the first byte of information is stored in the table. Now what if, over time, a new field is required for a customer? How about a Twitter handle or a new mobile phone number? We are seemingly stuck with a schema that no longer works for us. Well, HBase solves this issues for us as well — we can not only skip fields at no cost when we don’t have the data, but also dynamically add fields (or columns in the HBase vernacular ) over time without having to redesign the schema or disrupt operations. So we can think of HBase as a schema-less data store; that is, it’s fluid — we can add to, subtract from or modify the schema as we go along.

  1. This is a nice article.I would like to appreciate you for this post.


Leave Comment