On UTF-8, Latin 1 and charsets

Yesteday, I blogged about my experiments trying to determine the feasability of replacing the default Latin 1 codec in QString with UTF-8. In fact, the text I had for the blog yesterday was much longer, so I concentrated on the the actual code and performance and left the background, rationale and details for today.

Let me quote myself from the introduction yesterday:

But I was left wondering: this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the “micro” sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). I specifically added the Euro to the list because, unlike the others, it’s not part of Latin 1, so you need to use another encoding to represent it. Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.

Background: the charsets mandated by the C++ standard

The C and C++ standards talk about two charsets: the source input charset and the execution charset. The GCC manual argues that there are actually four: they add the wide-character charset (which is nowadays always UTF-16 or UCS-4, depending on how wide your wide char is) and the charset that the compiler uses internally. For my purposes here, let's stick to the first two.

The source input charset is the one your source file is encoded in. In the early days of C, when charsets were very different from one another, like EBCDIC, it was very important to get this right, or the compiler wouldn't understand which bytes represented even a space or a newline. Today, one could write a compiler that assumed the input charset is ASCII and still get away with it. The input charset is used by the compiler when it loads your file into memory and translates it into a form that it can parse.

The execution charset is the one that your strings are encoded in when the compiler writes the object files. That is, if you write a word imported into English like "Résumé", the compiler needs to find a way to encode those "é". Note that the compiler has loaded the source file into memory and converted it into some internal format before compiling, so we are assuming here that the compiler has understood that those are LATIN SMALL LETTER E WITH ACUTE. How those "é" were on disk has nothing to do with how this blog is encoded.

The GCC manual says that the default for the input charset is the locale's charset, while the default for the execution charset is UTF-8. That's not exactly true: unless you specify otherwise, GCC will output exactly the same bytes as it found in the input. You can verify this easily by trying to compile a Latin1-encoded file while on an UTF-8 locale. As expected, it works. I guess that changing that would break too many programs, so the GCC developers didn't do it. But if you add any of the -finput-charset= ot -fexec-charset= options, even to the supposedly default ones, GCC will bail out if it finds something improperly encoded.

About a week or two ago, we had this dicussion in the #qt IRC channel on Freenode. This one developer wanted to know why QString used Latin 1 instead of the execution charset to decode the string literals. He also wanted to know why he couldn't simply write "u00fc" to mean the "ü" letter. Well, the answer is actually simple and two-fold:

  1. Qt and QString don't know what execution charset you chose when you compiled your source code
  2. The execution charset isn't constant: one object file can have a different charset from another object or library

QString today

If you look at how QString really works, you'll see that it has some support for a changeable execution charset. When I say that it defaults to Latin 1, I am implying that it can be changed. In the QString documentation, any functions that take a const char * refer to the QString::fromAscii() function. That function is actually a misnomer: it doesn't convert necessarily from ASCII -- in fact, the documentation says "Depending on the codec, it may not accept valid US-ASCII (ANSI X3.4-1986) input."

The function is called fromAscii because most source code today is written in ASCII. This function was actually introduced in Qt 3 (see the docs) and that was released in 2002. Back then, UTF-8 wasn't as widespread as it is today -- I remember switching to UTF-8 on my Linux desktop only in 2003. That meant that any file with non-ASCII bytes had a high chance of being misinterpreted when sent to someone across the world, but a low chance if you sent it to a colleague in the same country.

So small teams developing applications sometimes wanted to use those non-ASCII characters that I listed in the introduction: the degree symbol, the copyright symbol, etc. And to accommodate them, QString allows you to change the codec that it uses to decode the string literals.

In other words, the QTextCodec::setCodecForCStrings allows you to tell Qt what your execution charset is (problem #1 above). There's however nothing to help you with problem #2, so libraries have to stick to telling Qt in each function what codec their strings are in.

Enter C++0x with a (partial) solution: Unicode literals

The next standard of the C++ language, still dubbed C++0x even though we're already in 2011, contains a new way of writing strings that ensure that they are always encoded in one of the UTF charsets: the new string literals. So you can write code as:

    u8"I'm a UTF-8 string."
u"This is a UTF-16 string."
U"This is a UTF-32 string."

And on the receiving side, QString will know that the encoding is UTF-8, UTF-16 and UTF-32 respectively, without a doubt. I mean, almost: the UTF-8 encoded string results in a const char[], which is no different from the existing string literals, so QString cannot tell one apart from the other. But the other two generate new types, respectively const char16_t[] and const char32_t[], which we can use in overloads and perfectly decode the string.

So the developer from IRC could write without fear u"u00fc" and be assured that QString would decode it as LATIN SMALL LETTER U WITH DIAERESIS (U+00FC).

My criticism of the C++0x committee is that they solved the problem only partially. I want to write u"Résumé" and send my file to a colleague using a different platform (like Windows). Moreover, I'd like his compiler to interpret my source code exactly as I intended. Of course, that means I'm going to encode my source file as UTF-8, so I'd like every single compiler to use UTF-8 as their source input charset.

The C++0x committee did not mandate that, nor did they include a way for me to mark my source file in such a way. The decoding of the source file really depends on the compiler's settings...

My preferred solution

In absence of being able to tell the compiler what my source code charset is, I'd settle for an efficient way of creating QStrings. Internally, QStrings store data as UTF-16 and that is not going to change. So we need to get the compiler to convert the source code literal to UTF-16. Using the C++0x new string literals, we can. And since those strings are in read-only memory that can never be unloaded, we can even do:

    QString s = QString::fromRawData(u"Résumé");

Ok, so we can't write "é" as the compiler could mis-interpret it, so we might have to settle for:

    QString s = QString::fromRawData(u"Ru00e9sumu00e9");

Which is still a bit too verbose for my taste. Yesterday, in my blog, someone suggested using macros to do the above. But if we use another feature of C++0x, the User-defined literals (see also the definition), we could define the following operator:

QString operator "" q(const char16_t *str, size_t len);

Which would allow me to write:

    QString s = u"Résumé"q;

which looks weird, but is at least very clean. Unfortunately, the latest release of GCC as of today hasn't implemented it yet.

Update: A friend reminds me that Herb Sutter has reported in his blog that the March 2011 meeting of the C++ standards committee has approved the Final Draft International Standard for the C++ language. It should be voted in the Summer and become known as C++ 2011.


Blog Topics:

Comments