Wednesday, November 04, 2009

Eventual Consistency by Example

Recently, there has been a lot of chitchat about the eventual consistency model as illustrated in the famous Amazon Dynamo paper, and today employed by several non-relational databases such as Voldemort or Cassandra.
Everything starts with this blog post by the Facebook Infrastructure Lead, claiming: "Dynamo: A flawed architecture", where he makes a few points against the eventual consistency model and the related "sloppy" quorum approach.

However, his points seems to be based on a few misconceptions which the Amazon paper doesn't help to clarify, so let's try to spread some light by first giving a few definitions, and then a simple example.


We may sum up the eventual consistency model in the following statement:

Given a total number on T nodes, we choose a subset of N nodes for holding key/value replicas, arrange them in a preference list calling the top node "coordinator", and pick the minimum number of writes (W) and reads (R) that must be executed by the coordinator on nodes belonging to the preference list (including itself) in order to define the write and read as "successful".

Here are the key concepts, extracted from the statement above:
  • N is the number of nodes defining the number of replicas for a given key/value.
  • Those nodes are arranged in a preference list.
  • The node at the top of the preference list is called coordinator.
  • W is the minimum number of nodes where the write must be successfully replicated.
  • R is the minimum number of nodes where the read must be successfully executed.
Writes are executed by the coordinator on the first W available nodes on the preference list.
Reads are executed by the coordinator on the first R available nodes on the preference list.
This is eventually consistent because consistency itself depends on W and R: you may successfully write and read from a number of nodes less than the total number of replicas (N), with conflicts resolved at read time and values asynchronously propagated by hinted-handoff or alike (which could both be a topic for another post).
The same way, this is a sloppy quorum because writes and reads may succeed even if not all replica nodes (N) are up and running, depending again on W and R.

More specifically, values for N, W and R can be tuned in order to:
  • Achieve high write availability by setting: W < N
  • Achieve high read availability by setting: R < N
  • Achieve full consistency by setting: W + R > N
This seems to be the most difficult part to grasp, so ... let's proceed with our example!

An example.

Let me show you the simplest possible example: a cluster of two nodes.

We have:
  • Node1 : the coordinator.
  • Node2 : guess what, the second cluster node.
  • N = 2 : our replication factor.
Let's examine all possible eventual consistency scenarios.

W = 1, R = 1 -> W < N, R < N

You get write/read high availability: you can write/read to/from whatever available node, but you may get inconsistent reads. For example, if you write to node1 and then read from node2 (maybe because node1 has suddenly failed), you'll get an inconsistent read.
With two nodes, this is the only eventually consistent scenario.
Let's see why.

W = 1, R = 2 -> W + R > N

This is a fully consistent scenario.
Here, you can write to whatever node, but you can only read when both nodes are online, more specifically:
  • If you write to node1 and then read from node2 without reading from node1 because suddenly death, the read will fail.
  • Same if writing and reading from the same node: if you write to node1 and then read from node1 without reading from node2 because suddenly death, the read will fail.
  • Finally, the two most misunderstood scenarios:
    • If you write to node2 because node1 is offline, then node1 gets online and you read the same value from it, your read will succeed because you have both nodes online and you will not get any inconsistent value! That's because node2 will have the most up-to-date version which will be "acquired" by node1 and returned to you!
    • If you write to node1, then it goes offline and you write again on node2, you will never lose the old value! That's because you will be able to read only after node1 coming online again: at that time, a read from node1 and node2 will return both values (because divergent), leaving the reconciliation process to the client.
W = 2, R = 1 -> W + R > N
W = 2, R = 2 -> W + R > N

Even easier: consistent reads are guaranteed by the W value equal to the replication factor.

Finally ...

Hope this write-up helped you clarifying a few misconceptions about eventual consistency.
Feedback is obviously highly appreciated.

See you!