Cap twelve years later: How the rules have changed

This is a reading note, The author is one of the proposers of ACID

Brewer, E. (2012). CAP twelve years later: How the" rules" have changed. Computer, 45(2), 23-29.

abstract

CAP theorem:

  • C: consistency to having a single up-to-date copy of the data
  • A: high availability of that data(for update)
  • P: tolerance to network partitions

You can have only who two of tree desirable properties. This can easily mislead designers.

Perfect availability and consistency in the presence of partitions, which are rare.(CA is impossible)

Designers still need to choose between consistency and availability when partitions are present. designer need to make a trade-off to maximize combinations of consistency and availability that make sense for the specific application.

If we want to preserve consistency and availability. it is obvious that we cannot maintain consistency in network partitioning. in some sense, the NoSQL movement is about creating choices that focus on availability first and consistency second; databases that adhere to ACID properties (atomicity,consistency,isolation and durability) do the opposite. The ACID,BASE, and CAP sidebar explains this difference in more detail.

ACID BASE CAP

The ACID properties focus on consistency and are the traditional approach of databases. Modern large scale wide area system use mix of both approaches.

Base acronym is a bit more awkward: basically available, soft state, eventually consistent. The relationship between CAP and ACID is more complex and often misunderstood, in part because the C and A in ACID represent different concepts than the same letters in CAP and in part because choosing availability affects only some of the ACID guarantees the four ACID properties are:

  • Atomicity (A): When the focus is availability, both sides of partition should still use atomic operations. Moreover, Higher-level atomic operations(the kind that ACID implies) actually simplify recovery.
  • Consistency (C): in ACID, the C means that transaction preserves all the data rules, such as unique keys. In contrast subset of ACID consistency. ACID consistency also cannot be maintained across partitions-partition recovery will need to restore ACID consistency. More generally, maintaining invariants during partitions might be impossible, this the need for careful thought about which operations to disallow and how to restore invariants during recovery.
  • Isolation (I): Isolation is at the core of CAP theorem: if the system requires ACID isolation, it can operate on at most one side during a partition. Serializability requires communication in general and thus fails across partitions. (Isolation dose not allow multiple transactions to run together, you can't sure is there a transaction conflict)
  • Durability(D): As with atomicity, there is no reason to forfeit durability. In general, running ACID transactions on each side of a partition makes recovery easier and enables a framework for compensating transactions that can be used for recovery from a partition.

The 2 of 3 view is misleading on several fronts. First because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned. second, the choice between C and A can occur many times within the same system at very fine granularity; not only can subsystem make different choices, but the choice can change according to the operation or even the specific data or user involved. Finally, all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partition have nuances, including disagreement within the system about whether a partition exists.

partitions are rare, CAP should allow perfect C and A most of the time, but when partitions are present or perceived, a strategy that detects partitions and explicitly accounts for them is in order. This strategy should have three steps: detect partitions, enter an explicit partition mode that can limit some operations, and initiate a recovery process to restore consistency and compensate for mistake made during a partitions.

CAP-latency connection

In the classic interpretation, the CAP theorem ignores latency, although in practice, latency and partitions are deeply related. Operationally, the essence of CAP takes place during a timeout, a period when the program must make a fundamental decision - the partition decision:

  • cancel the operation and thus decrease availability
  • proceed with the operation and thus risk inconsistency.

For example via Paxos or two phase commit, just delays the decision. At some point the program must make the decision; retrying communication indefinitely is in essence choosing C over A.

CAP Confusion

If users cannot reach the service at all, there is no choice between C and A except when part of the service run on the client.

scope of consistency reflects the idea that, within some boundary, state is consistent, but outside that boundary all bets are off. for example within a primary partition, it is possible to ensure complete consistency and availability, while out side the partition , service is not available. Paxos and atomic multicast system typically match this scenario.

with sharding, in which designers prepartition data across nodes, it is highly likely that each shard can make some prog ress during a partition. Conversely, if the relevant state is split across a partition or global invariants are necessary, then at best only one side can make progress and at worst no progress is possible.

In practice, most groups assume that a datacenter (single site) has no partitions within, and thus design for CA within a single site; However, although partitions are less likely within a datacenter, they are indeed possible, which makes a CA goal problematic. Finally, given the high latency across the wide area, it is relatively common to forfeit perfect consistency across the wide area for better performance.

Thus, A partitions is a time bound on communication. failing to achieve consistency within the time bound implies a partition and thus a choice between C and A this operation. In network communication, it is impossible to confirm whether nodes are partitioned or if network latency is to high.

Sometimes it makes sense to forfeit strong C to avoid the high latency of maintaining consistency over a wide area.

Managing partitions

The challenging case for designers is to mitigate a partition's effects on consistency and availability. the key idea is to manage partitions very explicitly, including not only detection, but also a specific recovery process and plan for all of the invariants that might be violated during a partition. This manage have three step.

  • detect the start of a partition
  • enter an explicit partition mode that may limit some operations
  • initiate partition recovery when communication is restored

Figure 1. The state starts out consistent and remains so until a partition starts. To stay available, both sides enter partition mode and continue to execute operations, creat ing concurrent states S1 and S2 , which are inconsistent. When the partition ends, the truth becomes clear and partition recovery starts. During recovery, the system merges S1 and S2 into a consistent state S' and also compensates for any mistakes made during the partition.


Once the system enters partition mode, two strategies are possible. The first is to limit some operations, thereby reducing availability. The second is to record extra infor mation about the operations that will be helpful during partition recovery. Continuing to attempt communication will enable the system to discern when the partition ends.

which operations should proceed

deciding which operation to limit depends primarily on the invariants that the system must maintain. For example , for the invariant that keys in a table are unique, designers typically decide to risk that invariant and allow duplicate keys during a partition. Duplicate keys are easy to detect during recovery, and , assuming that they can be merged, the designer can easily restore the invariant.

For an invariant that must be maintained during a partition, however, the designer must pro hibit or modify operations that might violate it. (In general, there is no way to tell if the operation will actually violate the invariant, since the state of the other side is not knowable.)

Essentially, the designer must build a table that looks at the cross product of all operations and all invariants and decide for each entry if that operation could violate the invariant. If so, the designer must decide whether to prohibit, delay, or modify the operation. in practice, these decisions can also depend on the know state, on the arguments, or both. For example, in systems with home node for certain data, operation can typically proceed on the home node but not on other nodes.

The best way to track the history of operations on both sides is to use version vectors, which capture the causal dependencies among operations. The vector’s elements are a pair (node, logical time), with one entry for every node that has updated the object and the time of its last update. Given two versions of an object, A and B, A is newer than B if, for every node in common in their vectors, A’s times are greater than or equal to B’s and at least one of A’s times is greater.

If it is impossible to order the vectors, then the update were concurrent and possibly inconsistent. Thus, given the version vector history of both sides. the system can easily tell witch operation are already in know order and which executed concurrently. Recent work proved that this kind of causal consistency is the best possible outcome in general if the designer chooses to focus on availability

partition recovery

At some point, communication resumes and the partition ends. At this point, the system knows the state and history of both sides because it kept a careful log during partition mode. The state is less useful than the history, from which the system can deduce which operations actually violated invariants and what results were externalized, including the responses sent to the user. The designer must solve two hard problems during recovery:

  • the state on both sides must become consistent
  • there must be compensation for the mistakes made during partition mode.

It is generally easier to fix the current time and replaying the full set of operations in a well-defined, deterministic order so that all nodes reached the same state. Similarly, source-code control system such as the concurrent versioning system (CVS) start from a shared consistent point and roll forward updates to merge branches.

Most systems cannot always merge conflicts. For example, CVS occasionally has conflicts that the user must resolve manually, and wiki system with offline mode typically leave conflicts in the resulting document that require manual editing

Conversely, some systems can always merge conflicts by choosing certain operations. A case in point is text editing in Google Docs,11 which limits operations to ap plying a style and adding or deleting text. Thus, although the general problem of conflict resolution is not solvable, in practice, designers can choose to constrain the use of certain operations during partitioning so that the system can automatically merge state during recovery. Delaying risky operations is one relatively easy implementation of this strategy.

Using commutative operations is the closest approach to a general framework for automatic state convergence. for example using addition.

Designers can choose to constrain the use of certain operations during partitioning so that the system can automatically merge state during recovery.

Recent work by Marc Shapiro and colleagues at INRIA12,13 has greatly improved the use of commutative operations for state convergence. The team has developed commutative replicated data types (CRDTs), a class of data structures that provably converge after a partition, and describe how to use these structures to

  • ensure that all operation during a partition are commutative
  • represent values on a lattice and ensure that all operation during a partition are monotonically increasing with respect to that lattice

The latter approach converges state by moving to the maximum of each side’s values. It is a formalization and improvement of what Amazon does with its shopping cart:14 after a partition, the converged value is the union of the two carts, with union being a monotonic set opera tion. The consequence of this choice is that deleted items may reappear.

compensation issues in an automated teller machine

In the design of an automated teller machine (ATM), string consistency would appear to be the logical choice, but in practice, A trumps C. The reason is straightforward enough: higher availability means higher revenue. Regardless, ATM design serves as a good context for reviewing some of the challenges involved in compensating for invariant violations

The essential ATM operations are deposit, withdraw, and check balance. The key invariant is that the balance should be zero or higher. Because only withdraw can violate. Because only withdraw can violate thee invariant, it will need special treatment, but the other two operations can always execute.

The ATM system designer could choose to prohibit withdrawals during a partition, since it is impossible to know the true balance at that time, but that would compromise availability. Instead, using stand-in mode (partition mode), modern ATMs limit the net withdrawal to at most k, where k might be $200. Below this limit, withdrawals work completely; when the balance reaches the limit, the system denies withdrawals. Thus, the ATM chooses a sophisticated limit on availability that permits withdrawals but bounds the risk.

When the partition ends, there must be some way to both restore consistency and compensate for mistakes made while the system was partitioned. restoring state is easy because the operations are commutative, but compensation can take several form. A final balance below zero violates the invariant. The bank compensate by charging a fee and expecting repayment. Given that the risk is bounded, the problem is not severe.

In general, because of communication delays, the banking system depends not on consistency for correctness, but rather on auditing and compensation.

compensating for mistakes

In addition to computing the postpartition state, there is somewhat harder problem of fixing mistakes made during partitioning. The tracking and limitation of partition mode operation ensures the knowledge of which invariants could have been violated, which in turn enables the designer to create a restoration strategy for each such invariant. typically, the system discovers the violation during recovery and must implement any fix at that time.

There are various way to fix the invariants, including trivial ways such as "last writer wins" (which ignores some update), smarter approaches that merge operation, and human escalation. The idea of compensation is really at the core of fixing such mistakes; designer must create compensating operation that both restore an invariant and more broadly correct an externalized mistake.

Technically, CRDT allow only locally verifiable invariants-a limitation that makes compensation unnecessary but that somewhat decreases the approach's power. however, a solution that uses CRDTs for state convergence could allow the temporary violation of a global invariant, converge the state after the partition, and then execute any needed compensations.

In a machine context, a computer could execute orders twice during a partition. If the system can distinguish two intentional order from two duplicate orders, it can cancel one of the duplicates. If externalized, one compensation strategy would be to autogenerate an e-mail to the customer explaining that the system accidentally executed the order twice but that the mistake has been fixed and to attach a coupon for a discount on the next order. With out the proper history, however the burden of catching the mistake is on the customer.

Some researchers have formally explored compensating transactions as a way to deal with long-lived transactions. Long-running transaction face a variation of the partition decision: is it better to hold locks for a long time to ensure consistency, or release them early and expose uncommitted data to other transactions but allow higher concurrency? A typical example is trying to update all employee records as a single transaction. serializing this transaction in the normal way lock all records and prevents concurrency. Compensating transactions take a different approach by breaking the large transaction into a saga, which consists of multiple subtransactions each of which commit along the way. Thus, to abort the larger transaction, committed subtransaction by issuing a new transaction that corrects for its effects-the compensating transaction.

In general the goal is to avoid aborting other transactions that used the incorrectly committed data (no cascading aborts). The correctness of this approach depends not on serializability or isolation, but ranther on the net effect of transaction sequence on state and outputs. This is, after compensation, does the database essentially end up in a place equivalent to where is would have been had the subtransactions never executed? The equivalence must include externalized actions; for example, refunding a duplicate purchase is hardly the same as not charging that customer in the first place, but it is arguably equivalent. The same idea holds in partition recovery. A service or product provider cannot always undo mistakes directly, but it aims to admit them and take new, compensating actions. How best to apply these ideas to partition recovery is an open problem.

Last modification:April 29, 2025
如果觉得我的文章对你有用,请随意赞赏