Some thoughts on binary compatibility

For the past few months I have been quite quiet in the blogosphere. I have been collecting ideas for a two- or three-parter blog that I am still going to write on how Qt rules, but while that doesn't come, I decided to dump some thoughts on binary compatibility.

Recently I updated the KDE Techbase article on Binary compatibility with C++ (btw, that's the 3rd page from the top in the Google search for "binary compatibility"). I tried to explain a bit better what the dos and don'ts (mostly the don'ts) are. After I wrote the part about overriding a virtual from a non-primary base, someone on IRC asked me to write some examples.

In order to write those examples, I had to brush up a bit on my skills of name mangling, virtual table layout, etc. and I had even to try and learn Microsoft Visual Studio ABI. It took me a while, but I did find an article with some information on that (link is in the Techbase page's introduction). I'm also glad I took the time to brush up on my skills, since I found another example of things not to do (the "virtual override with covariant return of different top address" case).

History

Let's start with a bit of history: whereas on the Unix has always been closely tied to the C language, the DOS market initially had no such relationship. Sure, applications were developed in C even in the early 80s, but the point is that DOS didn't provide a "C library". No, to access DOS services, you'd move a some values into registers and cause an interrupt (the Int 21). Implementors of C compilers had to provide their own C library.

Also remember that these were the days before DLLs and shared libraries, so there was no binary compatibility to maintain. The conclusion is that each compiler decided for itself how to implement the calling sequence and the ABI: that is, what are the responsibilities of the caller and the callee, like which processor registers (if any) are used for parameter passing, which ones may be used for scratch values, which ones must be preserves, who cleans up the stack, the size of certain types, the alignment, padding, etc.

And, as you can expect, each compiler implementation did that differently.

On the Unix world, things were a bit more standardised, since a C library had existed for a long while and usually there is a reference compiler for the operating system. In order to use that C library -- and you really want to -- any other compilers must implement the same ABI.

But even then things become exciting when we talk about C++. If on one hand the C calling convention is pretty well standardised on Unix systems, it's not so for C++. C is a very low-level language, to the point that you can almost see the assembly code behind C if you stare long enough at the screen (in my experience, however, when that happens, you're just seeing things and should instead go home and have some rest). C++ introduces several concepts on top of C, like overloads, virtual calls, multiple inheritance, virtual inheritance, polymorphism, covariant returns, templates, references, etc. That means more things for the compilers to differ on.

Now, an interesting thing happened about the year 2000: the Itanium processor. Not because of the processor itself, but for what documents came out of it. It wasn't enough to know the instruction set for the architecture (see the Software Developer's Manual), developers needed more and Intel obliged (apparently they had a lot of time on their hands):

GCC clearly adopted this ABI on Itanium, but since the code was there and it was superior to what GCC had, GCC applied it to other platforms as well. So it's interesting today to see this ABI used in systems that have nothing to do with the Itanium nor are Unix, like Symbian running on ARM devices.

What the ABI needs

It's quite clear that the ABI needs to accommodate any valid C++ program. That is, it should support all features of the language. Starting with the simplest innovation that C++ has on top of C, we can see how things become interesting.

In C, a function is uniquely identified by its name. There can be no other function with the same name with global scope. C++, on the other hand, has overloads: functions with the same name differing from each other only by the argument types they are called with. By that, we come to the conclusion that any and all ABI must encode the different functions with different names. It has to encode all the differences that are permissible by the C language, but it may also choose to encode more information which helps in outputting error messages.

Then there are virtual calls. When making a virtual call with a given C++ class, the compiler must somehow generate code that can call any reimplemented virtual, without knowing a priori what those reimplementations are. The only way it can do that is if, somewhere in the class, there's information about where the virtual call is supposed to go. Most (all?) compilers simply add a pointer somewhere in the object, pointing to the "virtual table": that is, a list of function pointers for each virtual call. Each C++ class with virtual function has a virtual table, listing the virtuals of that class (the ones it inherited and the ones it overrode).

But the virtual tables usually contain more information than just function pointers, like the typeinfo of a C++ class and usually the offsets of virtual bases into the object. The case of a virtual base is illustrated by the typical case of diamond-shaped multiple inheritance: a base "Base", two classes "A" and "B" virtually-deriving from "Base" and a final class "X" deriving from "A" and "B". When taken independently, A and B are similar to each other and the "Base" contents are allocated somewhere inside the "A" structure. However, inside "X", things change, since it must allocate one copy of "A", one copy of "B" and only one copy of "Base".

The compiler must therefore encode somewhere where it placed the VBase sub-object. One way is to simply have a pointer, as a member of both "A" and "B". Another is to put the offset from the beginning of "A" and "B" in the virtual table -- you save a couple of bytes in each object.

If you combine those three concepts (naming of all overloads possible, virtual calls and virtual inheritance), you cover 99% of the needs of the ABI for a typical C++ program.

Today

For our purposes with Qt, we can classify the C++ ABIs in three categories: systems using the Itanium C++ ABI, the Microsoft C++ ABI and "other". That last category is a group of all other compilers, like the Sun Studio compiler for Solaris, IBM's Visual Age compiler for AIX and HP's aCC compiler for HP-UX on PA-RISC. (note that HP-UXi runs on the Itanium so aCC uses the Itanium C++ ABI on that platform) We don't actively test Qt's binary compaitibility for issues specific to those three compilers for the simple reason that we have no clue what those specific issues are. I don't know of any documents describing the C++ ABI they implement -- and I really don't want to study them, given the value we'd get. After all, most users of those platforms usually are compiling Qt from source anyway.

The Itanium C++ ABI is a modern concept, created after C++ had been standardised and its features well-known. It was created by people who were trying to solve a problem: how to make all of C++ possible, without overdoing it? They came up with an ABI that is quite elegant: classes with virtuals get added as a first member a hidden pointer to the virtual table of the class, which itself gets emitted along the first non-inline virtual member function. The virtual table contains, at positive offsets, the function pointers of the virtual member functions, while at negative offsets it has the typeinfo and the offsets required to implement multiple inheritance.

Even the name mangling is quite readable, for simple types. The ground rule is that it should be something that C shouldn't use, to avoid collision: they chose the "_Z" prefix, since underscore + capital is reserved to the compiler. For example, take _ZN7QString7replaceEiiPK5QChari. If we break it down, we end up with:

_Z N 7QString 7replace E i i PK5QChar i

We read that as:

  • _Z: C++ symbol prefix
  • N...E: composed name:
    • 7QString: name of length 7 "QString"
    • 7replace: name of length 7 "replace"

    That means "QString::replace"

  • i: int
  • P: pointer
  • K5QChar: const name of length 5 "QChar" (i.e., const QChar)

Put everything together and we have "QString::replace(int, int, const QChar *, int)"

On the other end of the spectrum, the Microsoft compilers chose to encode the function names with every single detail possible, like for example whether a member function is public, protected or private. Moreover, for some obscure reason that probably doesn't make sense anymore, Microsoft mangling is also case-insensitive. That is, if someone flipped a switch tomorrow and -- gasp! -- made C++ case-insensitive, the mangling scheme that they use would work. (GCC of course would be completely lost in a case-insensitive C++ world)

That's quite clearly a legacy from old DOS days. That also shows when you notice that the mangling scheme encodes the pointer size (i.e., near or far), as well as whether the function call -- or, more to the point, the return -- is near or far. Those things are definitely not used today, but the ABI can still encode that.

The same function above gets encoded in MSVC as:

?replace@QString@@QAEAAV0@HHPBVQChar@@H@Z

Which we decode as:

  • ?: C++ symbol prefix
  • replace: rightmost (innermost) name
  • @: separator
  • QString: enclosing class
  • @@: terminates the function name.
    The names are in the reverse order, so we have "QString::replace"
  • Q: public near (i.e., not virtual and not static)
  • A: no cv-qualifiers for the function (i.e, not const or volatile)
  • E: __thiscall (i.e., call of member functions)
  • AA: the first "A" stands for reference (possibly near reference), the second "A" indicates it's an unmodified reference (i.e., not "const X &")
  • V...@: class and delimiter
    • 0: indicates the first class name seen before, i.e., QString
  • H: int
  • H: int
  • P: normal pointer (i.e., not const pointer)
  • B: const type -- PB together makes "const X *", whereas "X * const" would be QA
  • V..@: class and delimiter
    • VQChar@: class QChar, plus delimiter
  • H: int
  • @: end of argument list
  • Z: function, or code/text storage class

That reads: "public: class QString & near __thiscall QString::append(int, int, const class QChar *, int)". Things to note about this:

  1. The use of ? as prefix, instead of something you can normally type in C
  2. The same letter can mean different things depending on the position
  3. Types are assigned alphabetically from a list (signed char is C, char is D, unsigned char is E, short is F, unsigned short is G, int is H, unsigned int is I, etc.) instead of trying to resemble the type.
  4. "class" is encoded explicitly (V), whereas struct is "U", union is "T" and enum is "W4" (at least, int-backed enums)
  5. encoding of calling sequence (__thiscall) and displacement (near)

On one hand, the Microsoft mangling scheme makes it possible to produce much more detailed error messages, and makes a difference in type or calling sequence not resolve to the same symbol. On the other hand, it also encodes details that make no difference at all to the call, like the difference between "class" and "struct", or whether the member function is private, protected or public.


Blog Topics:

Comments