Brian's Waste of Time

Wed, 30 Jul 2008

Using Virtual Nodes to Compact Vector Clocks

One hiccup encountered when using vector clocks is that there is no inherent way of reducing the size of the clock. Basically, any node which acts as a causal agent of change has the potential to be forever recorded in the clock. This leads to unbounded clock size, with time. Most systems tend to have a limited number of causal, or lead, nodes providing clock values so it is avoided, but sometimes you don't have that.

When vector clocks are used to track causality in a storage system, such as in Amazon's Dynamo system, it becomes possible to create syncronization points in the history of the element, between storage nodes, if the storage nodes are able to form consensus between themselves on the value of an element at a specific point in the elements history. If we are talking an eventually consistent system, this can be done by using a background syncronization and merge algorithm which merges acausal changes in the background. Alternately, it could be client resolved, in systems like Dynamo, but that isn't my problem, so... I digress.

When the system believes it has a value at a given clock value, where the clock is causally related to the unified value on the other storage nodes holding the element, it can try to achieve concensus about this, and if successful, increment an artifical clock key which we'll call the epoch. If successful, the epoch value subsumes the vector clock values associated with the epoch in the element, shrinking the element's clock.

To run through an example, let's say we have a system which uses three storage nodes for each element. We don't care exactly how these elements values are assigned, except to recognize that it allows for non-causally related changes to occur. At a given point in time the storage nodes may have values for an element A, as follows:

red[red:2, green:1]
blue[red:2, green:1, blue:2]
green[red:3, green:1, blue:1]

A Paxos instance may be executed proposing that epoch:1 be [red:2, green:1]. As each node can agree that [red:2, green:1] comes before its value, it can accept the epoch value. Upon acceptance of the value, the clocks would become:

blue[epoch:1, blue:2]
green[epoch:1, red:3, blue:1]

Assuming a background reconciliation protocol, a system could apply an appropriate heuristic to decide when to atempt to increment the epoch. A good example of such would be after unrelated values have been successfully merged. When it makes sense, and how to back-off to older clock values really depends on the characteristics of the system being designed and how it will be used.

As pointed out in the Dynamo paper, systems where there tend to be a small number of keys in the clock don't generally have this problem, Dynamo avoids it by causing the clock keys to be based on a small number of likely coordinator nodes:

To this end, Dynamo employs the following clock truncation scheme: Along with each (node, counter) pair, Dynamo stores a timestamp that indicates the last time the node updated the data item. When the number of (node, counter) pairs in the vector clock reaches a threshold (say 10), the oldest pair is removed from the clock. Clearly, this truncation scheme can lead to inefficiencies in reconciliation as the descendant relationships cannot be derived accurately. However, this problem has not surfaced in production and therefore this issue has not been thoroughly investigated.

In something like that, it may not make a lot of sense -- the problem just doesn't tend to come up. On the other hand, other systems, such as one which uses a user id, or session id, as a clock key would tend to generate larger clocks. This kind of keying can be useful for providing read-what-I-wrote consistency, but that is another discussion :-)

0 writebacks [/src] permanent link