HBase data stores comprises of one or more tables, that are indexed by row keys. Data is stored in rows with columns, and rows can have multiple versions. Also, Columns are grouped into column families, which must need to be defined up front during the time of table creation. Column families are stored all together on disk, that is why HBase is referred as column based data store.
Row Key | Column Family :{Column Qualifier: Version: Value} |
001 | CustomerName: { ‘FN’: 1383859182496:‘Sheldon’, ‘LN’: 1383859182858:’Cooper’, ‘MN’: 1383859183001:’Wills’, ‘MN’: 1383859182915:’W’} ContactInfo: {‘EA’: 1383859183030:‘sh.cooper@mindstick.com’, ’SA’: 1383859183073: ’45 LT NY’} |
002 | CustomerName: {‘FN’: 1383859183103:‘Hank’, ‘LN’: 1383859183163:‘Moody’, ContactInfo: { ’SA’: 1383859185577: ‘16 TL CA’} |
Suppose we are given with two keys, if the byte at Index 1 of Key 1 is less than the byte at Index 1 of Key 2, then Row Key 1 will always be stored before Row Key 2, no matter what’s next in the sequence of bytes. However, it’s beneficial and common to use printable (ASCII) characters in comparison with numeric values for row keys in HBase and if we do, we need to know that the Java language represents characters using the Unicode Standard.
We may wonder why we need to bother about this fine detail with respect to row keys. The reason we are forcing on this point is that proper row key design is crucial for achieving high performance in HBase — not doing so means we won’t be able to realize the full value of our HBase cluster. Just keep in mind that sorted row keys can help us access our data faster.
Column Families
Above table shows two column families: CustomerName and ContactInfo. When we create a table in HBase, the developer or DBA admin is need to define one or more column families using printable characters. Usually, column families remain fixed throughout the entire life-cycle of an HBase table but new column families can be added or modified by using administrative commands. As per current status, the official recommendation for the number of column families per table was three or less. In addition, we should store data with similar access patterns in the same column family — we wouldn’t want a customer’s middle name stored in a separate column family from the first or last name because we generally access all name data at the same time.
Column families are always grouped together on disk, so grouping data with similar access patterns reduces overall disk access and improve performance.
Simond Gear
13-Apr-2017