> It can't tolerate N failures from N+1 *copies of your data* Sorry I got the te...

voidmain · on April 19, 2018

FoundationDB stores 2N+1 copies of some "coordination state" and does a consensus algorithm whenever it is updated. But this state doesn't contain a copy of your data; basically think of it as storing a replication configuration. It's very small and rarely changes.

In the happy case, replication takes place using the replicas and quorum rules specified by this configuration. For example, you might require writes to succeed synchronously against all N+1 replicas of some transaction log. After N failures, there will still be 1 replica remaining with the latest transactions. But in order to proceed after any failures, you have to do a consensus transaction against a majority of replicas of the coordination state, to specify the new set of N+1 replicas you will be using. And you also make sure that the 1 replica you are recovering from knows you are doing it, so that it won't continue to accept writes under the old replication configuration.

There can't be two partitions capable of committing transactions, because (in this case) you need either

(a) All N+1 replicas of the log, so that you can commit synchronously, or (b) A majority (N+1 out of 2N+1) of the replicas of the coordination state, AND 1 replica of the log

Sorry if this isn't a great explanation. Anyway it does work. I expect that you could rephrase this as an optimization of a consensus protocol, though I think it would be hard to build a performant and realistically featureful implementation that way.

infogulch · on April 19, 2018

Nope your explanation made sense, thank you!

When I wrote that I was wondering if it used a second 2N+1 dataset just for coordination & consensus. This has the benefit of separating data from consensus, allowing the N of N+1 data failure. But at the end of the day consistency still comes down to a N of 2N+1 failure tolerance of that second coordination state. It's smaller easier to replicate etc etc but it seems like it still has the same fault tolerance as just replicating the data 2N+1 times. It sounds like it's worked out great in practice for FDB.

But you say it rarely changes... but wouldn't it have to change every time there's a change to the dataset? I feel like this means you have to do even more replication and consensus than just replicating the data without this second consensus state.

voidmain · on April 19, 2018

You only have to write to the coordination state when there is a failure. You can commit millions of transactions in the happy case without ever doing such a write. And failure detector performance and other engineering concerns are usually more of a limitation, in practice, on the performance of recovery than the latency of the coordination state consensus, even when the coordinators are geographically distributed.

infogulch · on April 19, 2018

So the strategy is to optimistically assume that there are no failures and just replicate to all N+1 copies. If there's a failure then back off to the consensus state to coordinate the fix rigorously.

In the best case with no failures this works great. But as the number of failures increases, I feel like due to the extra synchronization there will be an inflection point where the cost of the extra layers of coordination will be higher than just synchronizing the data directly. But due to 'other concerns' that inflection point is pushed back by a lot.

Is that a reasonable characterization?

voidmain · on April 19, 2018

If you expect to have lots of (hopefully very temporary!) node failures, I think FoundationDB has another trick up its sleeve. You can store (say) N+2 replicas of transaction logs, which are also relatively small and (since sequential) efficient. Then you have a write quorum of N+1 and a recovery quorum of 2 logs, and you don't have to do coordination on every failure.

It's certainly true that with enough failures you aren't going to make much progress. I'm not sure that is any less true with plain old state machine replication protocols, though.

ccleve · on April 19, 2018

Yup, this works. I wonder if this is the first commercial application of this insight? http://hh360.user.srcf.net/blog/2016/08/majority-agreement-i...

Can you tell us what consensus algorithm you're using? Raft, or something else? Your own implementation or something off the shelf?

voidmain · on April 20, 2018

I'm not sure.

The coordination consensus is (our own implementation of) disk paxos, which we liked for its operational properties in our context (the coordinators don't need to know about each other or communicate directly). An early version of fdb had a dependency on Zookeeper for this purpose; you can use anything.

AlphaSite · on April 20, 2018

I know VSAN uses these and calls them witnesses.

https://cormachogan.com/2014/04/01/vsan-part-21-what-is-a-wi...

AlphaSite · on April 19, 2018

Maybe not all data is replicated across all zones, so zones provide consensus but replicas provide data. Basically you have witness with enough information to provide consistency but without the full data store.