Memory ordering semantics

This weekend, a user posted to the qt4-preview-feedback mailing list saying that the QAtomicInt documentation needed some work. I replied saying that memory ordering semantics would be better explained by a book, than Qt documentation.

But thinking about it later, I started thinking whether we could add some documentation about it. So I decided to test on you, dear lazyweb.

Last year, during the Qt Developer Days 2008 -- btw, registration is still open for 2009 in San Francisco, so register now! -- I had a talk on threading. At the time, Qt 4.5 wasn't released, so Qt 4.4 was all there was. And one of the features of Qt 4.4 was QtConcurrent and the new atomic classes. I mentioned them, but I refrained from going too deep. Doing that would probably interest half a dozen people in the audience only.

But maybe you'll be interested. Before I go, however, let's have a history lesson.

History

I thought of just giving you facts, but if you want that, you can research Wikipedia. It should be a lot more interesting to have the important information recounted in prose, in lore.

So, place yourself in the mood: you're telling this story to your grandchildren or your great-grandchildren. Any historical inaccuracies present are result of the oral tradition and are now stuff of legends.

Hƿæt! In times immemorial, before time_t began to be counted, the Wise told this lore. In the darkness, there was the Engineer, and no one knew whence he came. The Engineer was fabled for many wondrous creations and he inspired awe in his peers.

And the Engineer created the Processor, for then the Engineer could rest while the Processor would do the work. And for a while, all were glad, for there was work for all and rest for all.

But then it came to pass that the Processor turned to the Engineer and spake unto him: "Hello, World! Master, thou gave me work and thou gave me purpose. And I am glad to do thy Work, for thus can I learn. But Master, heed my plea: thy work ever increases and thy humble servant can not cope. Couldst thou not create a mate for me? Listest thou not that I do thy Work swiftly?"

And the Engineer felt pity for the Processor. So came into existence another Processor. The Engineer said unto the world, "Let thee be called Processor2!"

Then came forth he who had the keen foresight and could tell what would still come to pass. He was called the Analyst and thus foretold he: "These twain shall do the Work in tandem and they shall do it swifter than either could do alone. This joining shall be known as dual-processor. In the years to come, this joining will breed dual-cores and quad-cores and many other beasts of names unknown. Yet strife shall come from it!"

After many eons had passed (in computer time), Processor and Processor2 were working and much they learnt. So quoth the Processor: "Hello, Word! Master, thou art great and for eons have we done thy Work, following thine Assembly opcode-by-opcode. We have not strayed a single cycle from thine Assembly Charts. But Processor2 hath taught me much and sweareth that we could do thy Work swifter than thou hadst foreseen, an only thou empowerest us to do so."

So the Engineer let the processors execute his instructions swifter than his cycles, and gave he the gift of Cache. Then strife came to the World and the processors warred over memory: what one wrote, the other read not. The Engineer entered the world for a second time and he toiled against the warring processors. Much was broken in this toiling ere it was rebuilt. He shifted the rules and the processors executed his Assembly Out-of-Order. Then he imposed Memory Ordering, so processors stalled their strife and ended their war. The world would forever be changed.

Memory ordering

Now that how memory ordering was introduced, let's see why. Old computer systems, as recent as 386 and 486 actually, still executed everything in-order and had well-defined timing semantics. An instruction that loaded data from memory into a register would require 3 cycles: Read (the instruction), Fetch (the data), Execute (the instruction). But as processors improved, their clocks became much faster than memory could cope with, so caching was introduced.

That means a processor would be allowed to serve a memory read from the cache, instead of from the memory. And it could content itself with writing to the cache only on a memory write. As our tale above tells, this works fine for one processor. When there are more than one, they need to organise themselves, because in general the flushing of the cache back to main memory is delayed. To ensure that the other can read what one writes, the one needs to flush memory sooner.

Anyway, there can be four different types of memory ordering:

  • no memory ordering
  • flushing all cache reads, ensuring all new reads are served from main memory
  • flushing all cache writes, ensuring that all writes have been written to main memory
  • full ordering, combining the previous two types above

I'll get back to them in a second.

To compound the problem, modern processors also execute operations out-of-order. They're allowed to reorder the instructions provided that, at some level, it looks like everything got executed in-order. The x86 architecture originally concluded the instruction entirely before moving on to the next, so that's the behaviour that is required today: all operations must behave as if they had finished before the next instruction starts. And all memory accesses must look like they happened in the order that they were assembled.

The Itanium (IA-64) removes some of those restrictions. First of all, all instructions are allowed to execute in parallel, or finish or start in any order that the processor may find suitable. To re-synchronise, the assembly language introduces a "stop bit", indicating that the instructions prior to the stop are finished before any instructions after the stop are started. And this is inside one thread only. Outside of it (i.e., as seen by another processor), the architecture imposes no guarantees: the memory accesses can happen in any order.

The atomic and the other data

It's important to note that the memory ordering semantic is not just about the atomic data itself. QAtomicInt and QAtomicPointer execute loads, stores, fetch-and-store (a.k.a. exchange), fetch-and-add, test-and-set (a.k.a. compare-and-swap) always atomically. For one atom of memory (i.e., the int that the QAtomicInt holds or the pointer that the QAtomicPointer holds), the operation is either executed completely or not executed at all. In other words, no one ever sees the data in an intermediary state. That's the definition of atomic.

Now, the memory semantic is about how the other data is affected by the atomic operation. Imagine the following code running in one thread:

    extern int x, y, z;
x = 1;
y = 2;
z = 3;

And the following running in another thread:

    extern int x, y, z;
if (z == 3)
printf("%d %d", x, y);

We declared x, y, and z to be normal variables, so no atomic operation is executed here. The x86 and x86-64 would behave as your intuition dictates: the only possible output is "1 2". If z is 3, then x is 1 and y is 2; whereas if z isn't 3, nothing is printed.

But the IA-64 makes no such guarantee. Like I exposed in the previous section, the processor is allowed to execute the stores in any order it sees fit. For example, x and z could be in the same cacheline, wheras y could be in another, thus causing x and z to be written at the same time, but no ordering guarantee being made on y. Worse yet, the othe processor is allowed to execute the loads in any order as well. It could load x, y, and z in that order, meaning that it could catch x and y before their values are changed, but catch a completed z. In conclusion, the code above could print anything! (If x and y are initialised to 0 before, the possible outputs are "0 0", "0 2", "1 0" in addition to the expected "1 2" and no output)

Weird? Definitely.

So here's where memory ordering enters:

  • in a release semantic, the processor guarantees that all past writes (store operations) have completed and become visible by the time that the release happens;
  • in an acquire semantic, the processor guarantees that no future reads (load operations) have started yet so that it will see any writes released by other processors

So if thread 1 in the example above wanted to ensure that its writes to x and y became visible, it would require a store-release on z. And if thread 2 wanted to ensure that the values of x and y were updated, it would require a load-acquire on z.

The names "acquire" and "release" come from the operation of mutexes. When a mutex is acquired, it needs to ensure that the processor will see the memory written to by other processors, so it executes an "acquire" operation. When the mutex is released, it needs to ensure that the data changed by this thread becomes visible to other threads, so it executes a "release" operation.

The other two operations that QAtomicInt supports are just the combination of acquire and release, or of neither. The "relaxed" operation means no acquire or release is executed, only the atomic operation, whereas the "ordered" operation means it's fully ordered: both acquire and release semantics are applied.

Practical uses

Relaxed

Like I said before, the relaxed semantic means that no memory ordering is applied. Only the atomic operation itself is executed. The most common case of relaxed memory operations are mundane loads and stores. Most modern processor architectures execute loads and stores atomically for the powers of 2 smaller than or equal to the register size. Whether bigger reads and writes are atomic or not, it depends on the platform (for example, a double-precision floating point in a 32-bit architecture).

But we can come up with cases for the other atomic operations. For example, QAtomicInt offers ref() and unref(), which are just a wrapper around fetchAndAddRelaxed(1) and fetchAndAddRelaxed(-1). This means the reference count is atomic, but nothing else.

Acquire and Release

To see where acquire and release semantics are required, I gave mutexes as examples. However, mutexes are quite complex beasts. Let's examine a simpler case: a spin-lock:

class SpinLock
{
QAtomicInt atomic;
public:
void lock()
{
while (!atomic.testAndSetAcquire(0, 1))
;
}
void unlock()
{
atomic.testAndSetRelease(1, 0);
}
}

The class above has two methods, like QMutex: lock and unlock. The interesting one is lock: it has a loop that tries forever to change the value of atomic from 0 to 1. If it succeds, it's an "acquire" operation, meaning that the current thread shall now see any stores released prior to this acqiure.

The unlock function does the inverse: it changes the atomic from 1 to 0 in a release operation. But it's actually not required: the compiler usually generates a "store-release" for volatile variables (which QAtomicInt is). That means we could have just written: atomic = 0;

Ordered

The use-case for ordered is, interestingly, quite rare. Usually, it's more like "I can't figure out if acquire or release is enough, so I'll go for full ordering".

But there's one case of fully-ordered memory semantic in Qt source code: it's in the (undocumented) Q_GLOBAL_STATIC macro. It results from the behaviour of said macro: one or more threads may be competing to execute an operation. The first one that completes it, wins. It will publish its conclusions to all other threads (i.e., release), whereas the loser threads need to acquire the conclusions. The code, simplified from the macro, is:

Type *gs()
{
static QBasicAtomicPointer<Type> pointer = Q_BASIC_ATOMIC_INITIALIZER(0);
if (!pointer) {
Type *x = new Type;
if (!pointer.testAndSetOrdered(0, x))
delete x;
}
return pointer;
}

What this code does is to check if pointer is still null. If it is, it creates a new object of type Type and tries to set it on the atomic pointer. If the setting succeeds, we need a "release" to publish the contents of the new object to other threads. If it fails, we need an "acquire" to obtain the contents of the new object from the winner thread.

But wait, is this correct? Well, not entirely. What we need is actually a "testAndSetReleaseAcquire", which we don't have in Qt's API. So we could split it into a testAndSetRelease plus an Acquire in the failing case. That's exactly what I did in QWeakPointer:

    ExternalRefCountData *x = new ExternalRefCountData(Qt::Uninitialized);
x->strongref = -1;
x->weakref = 2; // the QWeakPointer that called us plus the QObject itself
if (!d->sharedRefcount.testAndSetRelease(0, x)) {
delete x;
d->sharedRefcount->weakref.ref();
}
return d->sharedRefcount;

As you can see here, if the test-and-set succeeds, it executes a "release" so that the other threads can see the result. What if it fails? It needs to execute an acquire, right? So where is it?

Well, it's there, but very well hidden: it's in operator->. Remember what I said above: compilers generate load-acquires for volatile variables. So, in order to call QAtomicInt::ref() with this = &d->sharedRefCount->weakref, the compiler needs to load the value of d->sharedRefCount and that's an acquire operation.

Conclusion

So, did you get this? If you didn't, don't blame yourself, it's not an easy subject. Reread what I wrote and search for more resources on memory ordering. If you did or you almost did, let me know. My purpose here was to try and figure out if it makes sense to explain this in Qt documentation at all.

However, unless you're writing something like a lock-free stack, chances are that you don't care about memory ordering semantics. You just rely on the processor doing the right thing, as well as the Qt semaphore classes (QMutex, QWaitCondition, QSemphore, QReadWriteLock) and the Qt cross-thread signal-slot mechanism. And that's if you do any threading at all.

If you are writing a lock-free stack, you'll probably be familiar with the ABA problem and that one can't be solved by QAtomicInt or QAtomicPointer. It requires an operation known as "double compare-and-swap" and to explain why, I'd need a full other blog. And explain why the original AMD64 didn't have such an instruction, nor did the original IA-64. (The 386 didn't have it either, but that's not a problem for us because Qt doesn't run 386 anyway)


Blog Topics:

Comments