![]() |
Toolbox snapshot
The Reactive C++ Toolbox
|
Concurrent execution of code is concerned with:
These concerns are typically addressed using locks and fences.
Locks provide mutually exclusive access to a shared resource. In other words, access to a shared resource is serialised, so that it cannot be accessed concurrently by multiple threads. According to Amdahl's law, performance is limited by the amount of sequential code on the critical path, so mutual exclusion must be minimised.
Atomic operations are indivisibility units of work that either happen entirely or not at all. When a shared resource is modified using a single, atomic operation, such as a Compare And Swap (CAS) instruction, mutual exclusion is kept minimal, which increases the potential performance gains.
Memory order is concerned with the order in which memory operations occur both from the reader's and writer's perspective. In the absence of memory ordering constraints, memory operations that occur in program order can be reordered by both the compiler and processor. The following reorderings are theoretically possible:
LoadLoad: loads reordered with other loads.
StoreStore: stores reordered with other stores.
LoadStore: loads reordered with subsequent stores.
StoreLoad: stores reordered with subsequent loads.
Memory order constraints prevent one or more of these reordering from happening across a memory barrier or fence. These constraints are defined by the C11 and C++11 standards using more abstract concepts: Relaxed, Consume, Acquire, Release, Acqrel and Seqcst.
N.B. Few compilers track the dependency chains required to implement Consume; most compilers simply promote Consume to Acquire. Consequently, no further consideration is given to Consume in this document.
Memory ordering semantics, as defined by the C11 and C++11 standards, are summarised below along with their implied fences:
memory order | fence |
---|---|
Relaxed | no-op |
Acquire | LoadLoad+LoadStore |
Release | StoreStore+LoadStore |
Acqrel | LoadLoad+StoreStore+LoadStore |
Seqcst | LoadLoad+StoreStore+LoadStore+StoreLoad |
These concepts are covered in more detail in the following sections.
Standalone fences impose memory order constraints between all memory operations before and after the fence. A StoreLoad fence, for example, prevents any store before the fence from being reordered with any load after the fence.
Atomic memory operations have the added benefit of associating fences with specific memory operations, which loosens the restrictions and provides greater opportunities for optimisation.
In C11 and C++11, not all combinations of atomic memory operation and memory order constraint are valid. Acquire is not valid with store operations, for example, and Release is not valid with load operations.
Relaxed provides no guarantees with respect to memory order, so it should only be used where there are no data dependencies. Sequence numbers and reference counts are most commonly associated with Relaxed semantics.
Acquire prevents LoadLoad and LoadStore reorderings, so that loads before the fence cannot be reordered with loads and stores after the fence. Acquire is typically associated with an atomic load before the fence:
Release prevents StoreStore and LoadStore reorderings, so that stores after the fence cannot be reordered with loads and stores before the fence. Release is typically associated with an atomic store after the fence:
Seqcst prevents all possible reorderings including StoreLoad, which establishes a consistent order of memory operations across all threads. Consider the following example of a ReleaseStore sequence followed by a LoadAcquire:
Notice how sequential consistency can be broken by reordering the store and load operations. The insertion of a StoreLoad fence prevents this:
The equivalent code using C++ atomics follows:
Dekker's Algorithm, described later in this document, depends on stores happening before loads, and so it is often cited as an example use-case for Seqcst.
Unlike Acquire, Release and Seqcst, which are often used with standalone fences, Acqrel is almost exclusively used with atomic Read Modify Write (RMW) operations, such add FetchAdd.
One or more of these reorderings may not be possible in practice on memory architectures with stronger memory models. Intel's "64 Architecture Memory Ordering" white paper, for example, specifies the first four memory ordering principles as follows:
The first three disallow LoadLoad, StoreStore and LoadStore reordering, while the fourth allows StoreLoad reordering. It is important to recognise, however, that without fences, compilers are free to apply any of the four reorderings.
The following table shows instruction sequences typically used on Intel for various memory-model combinations:
Relaxed | Acquire | Release | Acqrel | Seqcst | |
---|---|---|---|---|---|
Load | mov | mov | N/A | N/A | mov |
Store | mov | N/A | mov | N/A | mov; mfence |
FetchAdd | lock add | lock add | lock add | lock add | lock add |
Note that:
mfence
instruction is required to prevent StoreLoad reordering;mfence
instruction is 100 cycles;lock
prefix is effectively a full barrier.Some Memory Architectures, such as IA64, have load and store forms that directly model Acquire and Release semantics.
Simple compiler fences can be expressed using an empty assembly directive:
Java's Unsafe
class provides access to the following fences:
The Java Memory Model (JMM) defines a volatile load as an Acquire operation:
A volatile store is a Release operation followed by a StoreLoad fence:
The StoreLoad fence is required to establish sequential consistency, which ensures that a volatile store cannot be reordered with a volatile load that follows it in program order:
Note that the StoreLoad fence could have instead been issued before each volatile read, but this is likely to be less efficient in practice, because volatile variables are typically read more than they are written.
The following Unsafe
methods provide Release semantics without the overhead of a StoreLoad fence:
Dekker's Algorithm is best understood by the interleavings shown below:
Uncontended Case #1
thread x | thread y |
---|---|
x_wants = 1 | |
y_wants == 0 | |
Acquired | |
y_wants = 1 | |
x_wants == 1 |
Uncontended Case #2
thread x | thread y |
---|---|
y_wants = 1 | |
x_wants == 0 | |
Acquired | |
x_wants = 1 | |
y_wants == 1 |
Contended Case #1
thread x | thread y |
---|---|
x_wants = 1 | |
y_wants = 1 | |
y_wants == 1 | |
x_wants == 1 |
Contended Case #2
thread x | thread y |
---|---|
y_wants = 1 | |
x_wants = 1 | |
x_wants == 1 | |
y_wants == 1 |
Contended Case #3
thread x | thread y |
---|---|
x_wants = 1 | |
y_wants = 1 | |
x_wants == 1 | |
y_wants == 1 |
Contended Case #4
thread x | thread y |
---|---|
y_wants = 1 | |
x_wants = 1 | |
y_wants == 1 | |
x_wants == 1 |