Zookeeper enables coordination and synchronization with what we calls “znodes”, which are presented as a directory tree and resemble the file path names we would see in a Unix file system. Znodes are capable of storing data but not much to speak of — currently less than 1 MB by default. The concept here is that Zookeeper stores znodes in memory and that these memory-based znodes enables quick client access for coordination, status, and other important functions needed by distributed applications like HBase. Zookeeper replicates (make copies) znodes across the ensemble so if servers fail, the znode data will still be available as long as a majority quorum of servers is still up and running.
The next crucial Zookeeper concept concerns about how znode reads (versus writes) are managed. Every Zookeeper server can manage reads from a client, including the leader, however only the leader issues atomic znode writes — writes which are either completely succeed or completely fail. When a znode write request comes at the leader node, the leader broadcasts the write request to its following nodes and then waits for a majority of followers to acknowledge this znode write complete. After the acknowledgement, the leader issues the znode write itself and then reports the successful completion status to the client
Znodes enables some very powerful and flexible guarantees. When a Zookeeper client (like a HBase RegionServer) writes or reads a znode, this operation is atomic. It either completely succeeds or completely fails — there doesn’t exist any partial reads or writes. No other competing client can lead the read or write operation to fail. Furthermore, a znode has an access control lists (ACL) associated with it for security purposes, which supports versions, timestamps and notification to clients when it modifies.
Zookeeper replicates znodes across the ensemble so if a server fail, the znode data is still available as long as a majority quorum of servers is still up and running. This implies that writes to any znode from any Zookeeper server must be propagated across the ensemble. The Zookeeper leader handles this operation.
This znode write mechanism can result followers to fall behind the leader for short periods. Zookeeper resolves this potential problem by providing a synchronization command. Users that cannot tolerate this temporary lack of synchronization within the Zookeeper cluster may decide to issue a sync command before reading any znode.
In a znode world, we are going to come across what looks like the Unix-style pathnames. (Typically they begin with /hbase.) These pathnames, which are a subset of the znodes in the Zookeeper system created by HBase, are described in this list:
· master: Holds the name of the primary MasterServer,
· hbaseid: Holds the cluster’s ID,
· root-region-server: Points to the RegionServer holding the -ROOT- table),
· Something called /hbase/rs.
So now we may wonder what’sup with this rather vaguely defined /hbase/rs. In the previous post (HBase Architecture: Master Server Part 4), we describe the various operations of the MasterServer and mention that Zookeeper notifies the MasterServer whenever a RegionServer fails. Now we help you take a closer look at how the process actually works in HBase — and we would be right to assume that it has something to do with /hbase/rs. Zookeeper uses its watches mechanism to notify clients whenever a znode is created, accessed, or changed in some way. The MasterServers are Zookeeper clients as well as the RegionServers and can leverage these znode watches. When a RegionServer comes online in the HBase system, it connects to the Zookeeper ensemble and creates its own unique ephemeral znode under the znode pathname /hbase/rs. At the same time, the Zookeeper system establishes a session with the RegionServer and monitors the session for events. If the RegionServer breaks the session for whatever reason (by failing to send a heartbeat ping, for example), the ephemeral znode that it created is deleted. The action of deleting the RegionServer’s child node under /hbase/rs will cause the MasterServer to be notified so that it can initiate RegionServer failover. This notification is accomplished by way of a watch that the MasterServer sets up on the /hbase/rs znode.