Selecting a database which fits in our application requirement is a very daunting task since “no one size fits all”. (Even, my development team took more than 10 days to examine and select the ideal DB used for one of its ambitious projects.) If we want to examine and compare numerous kinds of databases either RDBMS’s or NSQL’s, no discussion would be complete without mentioning the CAP theorem. In computer science, it is also called as Brewer’s theorem after the name of scientist Eric Brewer.
This theorem represents the three kinds of guarantees that architects aim to
provide in their systems:
· Consistency: Same as the C in ACID properties of RDBMS’s, all nodes in the system would have the same view of the data at any time i.e. each read operation would bring us the most recent write.
· Availability: The system always responds to requests i.e. each node (if node failure didn’t occur) always execute queries.
· Partition tolerance: The system remains online if network problems occur between system nodes i.e. even if the communication within the nodes are down, the other two promise are kept(Consistency and Availability)
The CAP theorem states that in distributed networked systems, architects have to choose two of these three guarantees — we can’t promise our users all three. Generally that leaves us with the three possibilities shown in the adjoining equilateral triangle figure:
· Systems using classical relational technologies (RDBMS’s) generally are not partition tolerant, so they can guarantee consistency and availability (CA). That is, if one partition of these traditional relational technologies systems is offline, the whole system is offline.
· Systems where partition tolerance and availability (AP) are of primary importance can’t guarantee consistency, since updates (that destroyer of consistency) could be made on either side of the partition. The key-value data stores Dynamo and CouchDB and the column-family store Cassandra are good examples of partition tolerant/availability (PA) systems.
· Systems where partition tolerance and consistency (CP) are of primary importance can’t guarantee availability since the systems return errors until the partitioned state is resolved/ rectified.
Hadoop-based databases are referred as CP systems (consistent with partition tolerant). With data volumes stored redundantly across several slave nodes, outages to high portions (partitions) of a Hadoop cluster can be tolerated. Hadoop is refereed to be consistent since it has a centralized meta-data store (called as the NameNode) which manages a single, consistent view of data stored in the cluster. We can’t say that Hadoop always guarantees availability, because if the NameNode itself fails applications cannot access data in the cluster.