Memory ordering semantics

Published Tuesday October 6th, 2009
14 Comments on Memory ordering semantics
Posted in C++, Performance, Qt

This weekend, a user posted to the qt4-preview-feedback mailing list saying that the QAtomicInt documentation needed some work. I replied saying that memory ordering semantics would be better explained by a book, than Qt documentation.

But thinking about it later, I started thinking whether we could add some documentation about it. So I decided to test on you, dear lazyweb.

Last year, during the Qt Developer Days 2008 — btw, registration is still open for 2009 in San Francisco, so register now! — I had a talk on threading. At the time, Qt 4.5 wasn’t released, so Qt 4.4 was all there was. And one of the features of Qt 4.4 was QtConcurrent and the new atomic classes. I mentioned them, but I refrained from going too deep. Doing that would probably interest half a dozen people in the audience only.

But maybe you’ll be interested. Before I go, however, let’s have a history lesson.

History

I thought of just giving you facts, but if you want that, you can research Wikipedia. It should be a lot more interesting to have the important information recounted in prose, in lore.

So, place yourself in the mood: you’re telling this story to your grandchildren or your great-grandchildren. Any historical inaccuracies present are result of the oral tradition and are now stuff of legends.

Hƿæt! In times immemorial, before time_t began to be counted, the Wise told this lore. In the darkness, there was the Engineer, and no one knew whence he came. The Engineer was fabled for many wondrous creations and he inspired awe in his peers.

And the Engineer created the Processor, for then the Engineer could rest while the Processor would do the work. And for a while, all were glad, for there was work for all and rest for all.

But then it came to pass that the Processor turned to the Engineer and spake unto him: “Hello, World! Master, thou gave me work and thou gave me purpose. And I am glad to do thy Work, for thus can I learn. But Master, heed my plea: thy work ever increases and thy humble servant can not cope. Couldst thou not create a mate for me? Listest thou not that I do thy Work swiftly?”

And the Engineer felt pity for the Processor. So came into existence another Processor. The Engineer said unto the world, “Let thee be called Processor2!”

Then came forth he who had the keen foresight and could tell what would still come to pass. He was called the Analyst and thus foretold he: “These twain shall do the Work in tandem and they shall do it swifter than either could do alone. This joining shall be known as dual-processor. In the years to come, this joining will breed dual-cores and quad-cores and many other beasts of names unknown. Yet strife shall come from it!”

After many eons had passed (in computer time), Processor and Processor2 were working and much they learnt. So quoth the Processor: “Hello, Word! Master, thou art great and for eons have we done thy Work, following thine Assembly opcode-by-opcode. We have not strayed a single cycle from thine Assembly Charts. But Processor2 hath taught me much and sweareth that we could do thy Work swifter than thou hadst foreseen, an only thou empowerest us to do so.”

So the Engineer let the processors execute his instructions swifter than his cycles, and gave he the gift of Cache. Then strife came to the World and the processors warred over memory: what one wrote, the other read not. The Engineer entered the world for a second time and he toiled against the warring processors. Much was broken in this toiling ere it was rebuilt. He shifted the rules and the processors executed his Assembly Out-of-Order. Then he imposed Memory Ordering, so processors stalled their strife and ended their war. The world would forever be changed.

Memory ordering

Now that how memory ordering was introduced, let’s see why. Old computer systems, as recent as 386 and 486 actually, still executed everything in-order and had well-defined timing semantics. An instruction that loaded data from memory into a register would require 3 cycles: Read (the instruction), Fetch (the data), Execute (the instruction). But as processors improved, their clocks became much faster than memory could cope with, so caching was introduced.

That means a processor would be allowed to serve a memory read from the cache, instead of from the memory. And it could content itself with writing to the cache only on a memory write. As our tale above tells, this works fine for one processor. When there are more than one, they need to organise themselves, because in general the flushing of the cache back to main memory is delayed. To ensure that the other can read what one writes, the one needs to flush memory sooner.

Anyway, there can be four different types of memory ordering:

  • no memory ordering
  • flushing all cache reads, ensuring all new reads are served from main memory
  • flushing all cache writes, ensuring that all writes have been written to main memory
  • full ordering, combining the previous two types above

I’ll get back to them in a second.

To compound the problem, modern processors also execute operations out-of-order. They’re allowed to reorder the instructions provided that, at some level, it looks like everything got executed in-order. The x86 architecture originally concluded the instruction entirely before moving on to the next, so that’s the behaviour that is required today: all operations must behave as if they had finished before the next instruction starts. And all memory accesses must look like they happened in the order that they were assembled.

The Itanium (IA-64) removes some of those restrictions. First of all, all instructions are allowed to execute in parallel, or finish or start in any order that the processor may find suitable. To re-synchronise, the assembly language introduces a “stop bit”, indicating that the instructions prior to the stop are finished before any instructions after the stop are started. And this is inside one thread only. Outside of it (i.e., as seen by another processor), the architecture imposes no guarantees: the memory accesses can happen in any order.

The atomic and the other data

It’s important to note that the memory ordering semantic is not just about the atomic data itself. QAtomicInt and QAtomicPointer execute loads, stores, fetch-and-store (a.k.a. exchange), fetch-and-add, test-and-set (a.k.a. compare-and-swap) always atomically. For one atom of memory (i.e., the int that the QAtomicInt holds or the pointer that the QAtomicPointer holds), the operation is either executed completely or not executed at all. In other words, no one ever sees the data in an intermediary state. That’s the definition of atomic.

Now, the memory semantic is about how the other data is affected by the atomic operation. Imagine the following code running in one thread:

    extern int x, y, z;
    x = 1;
    y = 2;
    z = 3;

And the following running in another thread:

    extern int x, y, z;
    if (z == 3)
        printf("%d %d", x, y);

We declared x, y, and z to be normal variables, so no atomic operation is executed here. The x86 and x86-64 would behave as your intuition dictates: the only possible output is “1 2”. If z is 3, then x is 1 and y is 2; whereas if z isn’t 3, nothing is printed.

But the IA-64 makes no such guarantee. Like I exposed in the previous section, the processor is allowed to execute the stores in any order it sees fit. For example, x and z could be in the same cacheline, wheras y could be in another, thus causing x and z to be written at the same time, but no ordering guarantee being made on y. Worse yet, the othe processor is allowed to execute the loads in any order as well. It could load x, y, and z in that order, meaning that it could catch x and y before their values are changed, but catch a completed z. In conclusion, the code above could print anything! (If x and y are initialised to 0 before, the possible outputs are “0 0”, “0 2”, “1 0” in addition to the expected “1 2” and no output)

Weird? Definitely.

So here’s where memory ordering enters:

  • in a release semantic, the processor guarantees that all past writes (store operations) have completed and become visible by the time that the release happens;
  • in an acquire semantic, the processor guarantees that no future reads (load operations) have started yet so that it will see any writes released by other processors

So if thread 1 in the example above wanted to ensure that its writes to x and y became visible, it would require a store-release on z. And if thread 2 wanted to ensure that the values of x and y were updated, it would require a load-acquire on z.

The names “acquire” and “release” come from the operation of mutexes. When a mutex is acquired, it needs to ensure that the processor will see the memory written to by other processors, so it executes an “acquire” operation. When the mutex is released, it needs to ensure that the data changed by this thread becomes visible to other threads, so it executes a “release” operation.

The other two operations that QAtomicInt supports are just the combination of acquire and release, or of neither. The “relaxed” operation means no acquire or release is executed, only the atomic operation, whereas the “ordered” operation means it’s fully ordered: both acquire and release semantics are applied.

Practical uses

Relaxed

Like I said before, the relaxed semantic means that no memory ordering is applied. Only the atomic operation itself is executed. The most common case of relaxed memory operations are mundane loads and stores. Most modern processor architectures execute loads and stores atomically for the powers of 2 smaller than or equal to the register size. Whether bigger reads and writes are atomic or not, it depends on the platform (for example, a double-precision floating point in a 32-bit architecture).

But we can come up with cases for the other atomic operations. For example, QAtomicInt offers ref() and unref(), which are just a wrapper around fetchAndAddRelaxed(1) and fetchAndAddRelaxed(-1). This means the reference count is atomic, but nothing else.

Acquire and Release

To see where acquire and release semantics are required, I gave mutexes as examples. However, mutexes are quite complex beasts. Let’s examine a simpler case: a spin-lock:

class SpinLock
{
    QAtomicInt atomic;
public:
    void lock()
    {
        while (!atomic.testAndSetAcquire(0, 1))
            ;
    }
    void unlock()
    {
        atomic.testAndSetRelease(1, 0);
    }
}

The class above has two methods, like QMutex: lock and unlock. The interesting one is lock: it has a loop that tries forever to change the value of atomic from 0 to 1. If it succeds, it’s an “acquire” operation, meaning that the current thread shall now see any stores released prior to this acqiure.

The unlock function does the inverse: it changes the atomic from 1 to 0 in a release operation. But it’s actually not required: the compiler usually generates a “store-release” for volatile variables (which QAtomicInt is). That means we could have just written: atomic = 0;

Ordered

The use-case for ordered is, interestingly, quite rare. Usually, it’s more like “I can’t figure out if acquire or release is enough, so I’ll go for full ordering”.

But there’s one case of fully-ordered memory semantic in Qt source code: it’s in the (undocumented) Q_GLOBAL_STATIC macro. It results from the behaviour of said macro: one or more threads may be competing to execute an operation. The first one that completes it, wins. It will publish its conclusions to all other threads (i.e., release), whereas the loser threads need to acquire the conclusions. The code, simplified from the macro, is:

Type *gs()
{
    static QBasicAtomicPointer<Type> pointer = Q_BASIC_ATOMIC_INITIALIZER(0);
    if (!pointer) {
        Type *x = new Type;
        if (!pointer.testAndSetOrdered(0, x))
            delete x;
    }
    return pointer;
}

What this code does is to check if pointer is still null. If it is, it creates a new object of type Type and tries to set it on the atomic pointer. If the setting succeeds, we need a “release” to publish the contents of the new object to other threads. If it fails, we need an “acquire” to obtain the contents of the new object from the winner thread.

But wait, is this correct? Well, not entirely. What we need is actually a “testAndSetReleaseAcquire”, which we don’t have in Qt’s API. So we could split it into a testAndSetRelease plus an Acquire in the failing case. That’s exactly what I did in QWeakPointer:

    ExternalRefCountData *x = new ExternalRefCountData(Qt::Uninitialized);
    x->strongref = -1;
    x->weakref = 2;  // the QWeakPointer that called us plus the QObject itself
    if (!d->sharedRefcount.testAndSetRelease(0, x)) {
        delete x;
        d->sharedRefcount->weakref.ref();
    }
    return d->sharedRefcount;

As you can see here, if the test-and-set succeeds, it executes a “release” so that the other threads can see the result. What if it fails? It needs to execute an acquire, right? So where is it?

Well, it’s there, but very well hidden: it’s in operator->. Remember what I said above: compilers generate load-acquires for volatile variables. So, in order to call QAtomicInt::ref() with this = &d->sharedRefCount->weakref, the compiler needs to load the value of d->sharedRefCount and that’s an acquire operation.

Conclusion

So, did you get this? If you didn’t, don’t blame yourself, it’s not an easy subject. Reread what I wrote and search for more resources on memory ordering. If you did or you almost did, let me know. My purpose here was to try and figure out if it makes sense to explain this in Qt documentation at all.

However, unless you’re writing something like a lock-free stack, chances are that you don’t care about memory ordering semantics. You just rely on the processor doing the right thing, as well as the Qt semaphore classes (QMutex, QWaitCondition, QSemphore, QReadWriteLock) and the Qt cross-thread signal-slot mechanism. And that’s if you do any threading at all.

If you are writing a lock-free stack, you’ll probably be familiar with the ABA problem and that one can’t be solved by QAtomicInt or QAtomicPointer. It requires an operation known as “double compare-and-swap” and to explain why, I’d need a full other blog. And explain why the original AMD64 didn’t have such an instruction, nor did the original IA-64. (The 386 didn’t have it either, but that’s not a problem for us because Qt doesn’t run 386 anyway)

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in C++, Performance, Qt

14 comments

Jesus says:

I tend to agree, even though you’ve written a monster blog post (with some excellent poetry!) you’ve just barely scratched the surface of the topic. Like my OS teacher taught us: Making an OS is really simple – until you start implementing concurrency.

The History-part was plain awesome 😉
While the actual low-level memory-ordering isn’t that interesting for me. When I do concurrent stuff, it’s usually at such a high level (like executing database-operations in parallel or simply doing a long-running task in the background so that the GUI does not hang) that queued connections are more than enough for me. And they make my life so much easier, not having to worry about mutexes and locks.

I got it… but I can’t say as to whether this kind of thing needs to be in our docs. I really don’t know… when we put these classes into the public API, we kind of assumed that people would know about the concepts before starting to use the classes (but we all know dangerous assumptions can be).

Also, another thing to keep in mind is that LL/SC architectures in general don’t need a double-cas to get around the ABA problem. Would be interesting to see if and we could get something like that into the API.

Paul says:

Having been tripped up by Boost shared_ptr, it’s useful to have explanations of memory ordering issues.

In my case I’d assumed that shared_ptr construction, destruction, copy and assignment were thread safe. The code worked for years until a highly concurrent load put it through its paces and exposed the bugs.

Boost 1.36 added atomic operations to its shared_ptr class.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2632.html

Good to see similar attention to detail in Qt.

Razvan Petru says:

Interesting post Thiago; I appreciate these small technical essays. I think that the current docs are explicit enough.

Paul: isn’t shared_ptr construction and destruction thread safe though? I assume that those operations just change the refcount, which is atomic.

João says:

@Razvan I assume @Paul is referring to concurrent use of a single shared shared_ptr instance, and not concurrent use of multiple shared_ptr instances managing the same pointer… If that makes any sense 🙂

James Mansion says:

I’m not sure about your explanation of acquire and release wrt mutexes.

I think the acquire is needed to ensure that if we read the indication that says we have got the mutex, and then we read data that is associated with the mutex, then that associated data does not get reoprdered forward of the mutex being changed. Similarly when we release a mutex, we need all the writes of associated data to be forced into memory before the state of the mutex itself is toggled.

But its confusing: I could be wrong. 🙂

Dave says:

Thiago,

I love your articles, especially the ones with history lessons. They always give me something new to learn, and doubly so in this case. Thanks!

Not sure if it should be included in the Qt docs (probably not), but it makes great reading nonetheless.

Harald Fernengel says:

Thiago,

make it a Qt Quarterly article, then we can reference it from the docs for people who are interested. I never cared about the concepts until I had to hunt a bug, and suddenly it was all important to understand them.

Daniel says:

Q_GLOBAL_STATIC looks very interesting. Am I correct in guessing that it provides lazy, thread-safe, global initialization and destruction? If so, I think this would be a really nice thing to put in the Qt public API.

panzi says:

This is all very general and very basic stuff you learn in the 2nd or 3rd semester when you study computer science. Well, if you study computer science. I think this is only of concern to people that work at such a low level. Such people hopefully have studied this or at least know it through other means, but in any way know it already before they start coding away (you can do a lot wrong here). I think everyone who does not know what “testAndSet” means should not have any use for it. If you need it, you know it. If you don’t know it, you don’t need it. So just mentioning memory ordering, pipelining and maybe concurrency etc. should be enough. In case you want to know more, you’d have to read a book or study computer science. I don’t think any STL implementations documentation explains the algorithm of AVL trees. It just states std::map is an AVL tree (in case it is).

Thiago Macieira says:

Daniel: Q_GLOBAL_STATIC provides lazy, thread-safe, lock-free, global initialisation and destruction, with a kink: contended creation may create the object more than once.

I have another macro called Q_ONCE that would do lazy, thread-safe, global initialisation and destruction, without the kink, but it’s locked.

I think the examples are good to clear up any misunderstandings. Other than that, the current documentation is perfectly readable even for a mathematician like me who has no formal computer science background. It’s not like this is rocket science, even if it is very easy to make a mistake.

daorus says:

Thiago, when do you thnik QTest will be the parent object for all the Qt objects as QObject is now?

Commenting closed.

Get started today with Qt Download now