Saturday, December 19, 2009

Terrastore and the Cap Theorem

This is an edited version of the original "Terrastore and the Cap Theorem" article, updated to reflect latest Terrastore developments.

Terrastore is a new born document store based on the wonderful Terracotta technology, focused on providing a feature-rich, scalable, yet consistent, data store.

Data stores today are often classified depending on how they deal with availability, partition-tolerance and consistency, and how they resolve the conflict between them.
Such a conflict is described as the CAP theorem, stating that you can only pick two of them and relax the third: the matter isn't trivial, but it's not the focus of this post, so I will not go any further with it and point you at some well written posts such as this one from Julian Browne, and this one from Jeff Darcy.

With the CAP theorem in mind, a few people wisely asked: how does Terrastore relate to it?

Let's discuss!

Terrastore architecture in a nutshell

Terrastore is a distributed/partitioned store with a master-based topology powered by Terracotta.
It can be configured to work with a single cluster, made up of one active master, one or more optional passive masters, and one or more servers; or it can be configured to work with a clusters ensemble: a set of more clusters working together as a single whole.
All Terrastore documents are automatically partitioned between clusters (if more than one is provided as an ensemble) and intra-cluster server nodes, so that every node only holds in its own main memory a subset of all documents.
All Terrastore server nodes are equal peers, talking to each other only for requesting documents actually owned by some other node.
Data replication is managed by the central (Terracotta) master: each active master manages data replication inside its own cluster; so when working as an ensemble, data never cross a cluster boundary and each active master scales independently from others.

Consistency (C)

Terrastore is a consistent store.
What does it mean? Does it support serializable transactions? Is it fully ACID?

Well, Terrastore is somewhat in the middle between ACID and BASE.
ACID means Atomic, Consistent, Isolated and Durable: that is, the flagship characteristics of your old relational database.
BASE means Basically Available, Soft state, Eventually consistent: that is, the flagship characteristics of many distributed non-relational databases currently on the market, which do not (normally) give any time guarantee in regards to data consistency among all distributed nodes (see my previous post about the topic).

Terrastore provides per-document consistency, meaning that all single updates to a single document will be atomic, isolated and durable: you will always see the latest version of a document, but no support is provided for transactions between different documents or between consecutive operations on the same document.

Avoiding cross-operation and cross-document transactions makes Terrastore more scalable than relational databases, while at the same time as safe as them to reason about updates order.
The complexity of a consistent write operation is very low, because no synchronous replication is involved between nodes: it has a best-case complexity of O(1), requiring only a semi-asynchronous replication toward the master, and a worst case of O(2) if the written document should be owned by another node (and hence internally routed there).

Availability (A) and Partition Tolerance (P)

I'll talk about availability and partition tolerance together because you can't really consider the former without taking into account the latter.

Terrastore availability depends on whether we're talking about clusters, masters or servers.

In case of a Terrastore ensemble made up of several clusters, if one or more clusters are completely unreachable all data hosted inside them will be unavailable, but other clusters with related data will remain available.
This means Terrastore can tolerate partitions between clusters: a cluster will continue serving its and other available clusters data, while considering as unavailable data belonging to unreachable clusters.

When moving our focus inside a single cluster, Terrastore will be fully available as far as there's one reachable master and one reachable server.
You can dynamically run and shutdown how many server nodes you want, and they can fail at any rate.
But, if connection between server(s) and master(s) is lost, the cluster will suddenly be unavailable: this means that Terrastore can't tolerate partitions between servers and masters inside a single cluster.

Final words: is Terrastore superior to other solutions?

The answer is obviously: no, it isn't.
Whatever other product-owners say, the truth is just that: there's no silver bullet, nor universal solution.
Just go with a non-distributed non-relational solution if you need super-fast access to your data.
Take a look at a "base"-oriented solution if you have very large amount of data, probably distributed on several data centers, and need both partition-tolerance and full data availability.
Or go with Terrastore if you need a distributed store with consistency features.

Pick your choice.
And be happy.


Andre said...

Hi Sergio,
great writeup. I just have one question: How does Terastore differentiate between a partition and a node failure. You stated:

"Until there will be at least one available node, data will be always automatically re-balanced and served"

The question I have is:
1.) How is the master elected and when is an election triggered?
2.) How do you ensure that only one active partition exists?


Sergio Bossa said...

Hi Andre,

thanks much for your kind words.

Terrastore can tolerate server node failures until there's at least one running server node, as well as master node failures until there's at least one passive master ready to take over.
Terrastore can *not* tolerate partitions between server nodes and master nodes: that is, if server nodes cannot reach any master server anymore, they will stop working.

Details about how active master election works, are explained in the Terracotta documentation, see:
To make a long story short, a network protocol is used, with configurable parameters (to tune depending on latencies) as explained in the link above.

Hope that helps, let me know if you have other questions.

Sergio B.

Vivian Salvatore said...

Cheap D3 Gold I possibly could create complete documents on SCII's principles, nevertheless here I am just mainly attracting an empty.If GW2 Gold do not have in mind the essentials you cannot do anything whatsoever.Knowledge. Information is more critical when compared with virtually other things (apart from essentials) at this point