I think everybody remember his first surfing experience on the World Wide Web,; we just knew that it was an incredible innovation for the IT industry. Having this vast ocean of knowledge at our fingertips was transformation. Times change very quickly, though, and now the Internet is truly just another part of everyday life that many of the folks take for granted. We open our favourite browser and visit a search engine, and — in a matter of seconds — we are learning something new
If we just take a step back and ponder the immensity of the web and, more specifically, how exactly an entity such as Google stores all those references and web pages for our use? If the picture in our mind includes the concept of a database, we are right, but what kind of database? Every database administrator has thought about limits at one time or another. Storing gigabytes (GB), or even terabytes (TB) of data using our database of choice is common, but we may be faced with petabytes (PB) of data as Google was when it sought to index the web. The company’s strategy was to use BigTable — Google researchers even published an important paper outlining their vision of BigTable in 2006.
We may wonder what all this has to do with the history of HBase. Well, HBase is an implementation of Google’s BigTable distributed data storage system (DDSS, for short). After Google’s release of the BigTable paper, Powerset, a company focused on building a natural language processing (NLP) search engine for the Internet, became interested in creating its own implementation of BigTable. So when the University of Michigan’s Mike Cafarella made his first code drop of HBase to the Apache Open Source community in early 2007, Powerset engineers decided to carry the work forward. By 2008 HBase had become a sub-project of Hadoop and in 2010 HBase became an Apache top-level project. HBase, which has an affinity to Hadoop, is referred to as “the Hadoop database” on its Apache web page.
HBase is a Java implementation of Google’s BigTable. Google defines BigTable as a “sparse, distributed, persistent multidimensional sorted map.” HBase data stores consist of one or more tables, which are indexed by row keys. Data is stored in rows with columns, and rows can have multiple versions. Columns are grouped into column families, which must be defined up front during table creation. Column families are stored together on disk, which is why HBase is referred.