> It can't tolerate N failures from N+1 copies of your data
Sorry I got the terminology wrong, but that's a distinction without a difference. If it can tolerate N failures from N+1 copies, that means a network partition would allow any one copy to continue chugging along making changes by itself. You have two options: consistency is dropped and you downgrade to eventually consistent (at best), or availability is dropped meaning a single node can't make changes without a majority, which invalidates the N of N+1 failures claim. (Which is where the N failures of 2N+1 copies claim comes from in the first place: after N failures you still have a majority of copies.)
Or there's some other magic quorum protocol I've never heard of that makes the majority problem disappear.
FoundationDB stores 2N+1 copies of some "coordination state" and does a consensus algorithm whenever it is updated. But this state doesn't contain a copy of your data; basically think of it as storing a replication configuration. It's very small and rarely changes.
In the happy case, replication takes place using the replicas and quorum rules specified by this configuration. For example, you might require writes to succeed synchronously against all N+1 replicas of some transaction log. After N failures, there will still be 1 replica remaining with the latest transactions. But in order to proceed after any failures, you have to do a consensus transaction against a majority of replicas of the coordination state, to specify the new set of N+1 replicas you will be using. And you also make sure that the 1 replica you are recovering from knows you are doing it, so that it won't continue to accept writes under the old replication configuration.
There can't be two partitions capable of committing transactions, because (in this case) you need either
(a) All N+1 replicas of the log, so that you can commit synchronously, or
(b) A majority (N+1 out of 2N+1) of the replicas of the coordination state, AND 1 replica of the log
Sorry if this isn't a great explanation. Anyway it does work. I expect that you could rephrase this as an optimization of a consensus protocol, though I think it would be hard to build a performant and realistically featureful implementation that way.
When I wrote that I was wondering if it used a second 2N+1 dataset just for coordination & consensus. This has the benefit of separating data from consensus, allowing the N of N+1 data failure. But at the end of the day consistency still comes down to a N of 2N+1 failure tolerance of that second coordination state. It's smaller easier to replicate etc etc but it seems like it still has the same fault tolerance as just replicating the data 2N+1 times. It sounds like it's worked out great in practice for FDB.
But you say it rarely changes... but wouldn't it have to change every time there's a change to the dataset? I feel like this means you have to do even more replication and consensus than just replicating the data without this second consensus state.
You only have to write to the coordination state when there is a failure. You can commit millions of transactions in the happy case without ever doing such a write. And failure detector performance and other engineering concerns are usually more of a limitation, in practice, on the performance of recovery than the latency of the coordination state consensus, even when the coordinators are geographically distributed.
So the strategy is to optimistically assume that there are no failures and just replicate to all N+1 copies. If there's a failure then back off to the consensus state to coordinate the fix rigorously.
In the best case with no failures this works great. But as the number of failures increases, I feel like due to the extra synchronization there will be an inflection point where the cost of the extra layers of coordination will be higher than just synchronizing the data directly. But due to 'other concerns' that inflection point is pushed back by a lot.
If you expect to have lots of (hopefully very temporary!) node failures, I think FoundationDB has another trick up its sleeve. You can store (say) N+2 replicas of transaction logs, which are also relatively small and (since sequential) efficient. Then you have a write quorum of N+1 and a recovery quorum of 2 logs, and you don't have to do coordination on every failure.
It's certainly true that with enough failures you aren't going to make much progress. I'm not sure that is any less true with plain old state machine replication protocols, though.
The coordination consensus is (our own implementation of) disk paxos, which we liked for its operational properties in our context (the coordinators don't need to know about each other or communicate directly). An early version of fdb had a dependency on Zookeeper for this purpose; you can use anything.
Maybe not all data is replicated across all zones, so zones provide consensus but replicas provide data. Basically you have witness with enough information to provide consistency but without the full data store.
Sorry I got the terminology wrong, but that's a distinction without a difference. If it can tolerate N failures from N+1 copies, that means a network partition would allow any one copy to continue chugging along making changes by itself. You have two options: consistency is dropped and you downgrade to eventually consistent (at best), or availability is dropped meaning a single node can't make changes without a majority, which invalidates the N of N+1 failures claim. (Which is where the N failures of 2N+1 copies claim comes from in the first place: after N failures you still have a majority of copies.)
Or there's some other magic quorum protocol I've never heard of that makes the majority problem disappear.