Improving the string performance with more SIMD (or not)

Inspired by some of the work that Benjamin, Samuel and others were doing with SIMD, and taking the opportunity of learn directly from Intel engineers who know this subject better than just about anybody, I decided to get my hands dirty and do some SIMD work too, after work hours. Since I don't understand the first thing about composition modes, blending or rendering, I decided to try my luck on something I do understand in QtCore.

In the beginning of this year, Benjamin wrote an SSE2- and Neon-improved version of the QString conversion to and from Latin 1 (ISO-8859-1). It's in the Qt 4.7 branch (see commit 60fd302e) and we got a 4x improvement in the fromLatin1 conversion, slightly less on the opposite direction, due to the need for checking whether the UTF-16 codepoint does fit in Latin 1 in the first place. And last year I had also improved on the QString::operator== function, after some code by Lars.

So that's where I decided to improve on.

And just so we're clear and you don't want to punch me in the face when you reach the end of this blog, I ended up with no improvements, so my code isn't being added to qstring.cpp. I did learn a lot, though. And you can see all my work in the series of commits ending in 9a468f59472e8978aa18b75e9786718a8823bbec.

So the first thing I tried was to take the qMemEquals function in three different forms: byte-by-byte comparison (a.k.a. "memcmp"), ushort-by-ushort comparison and an adaptative code that tries to do 32-bit comparisons after aligning the data. The latter is what qMemEquals looks like in Qt 4.6 and 4.7.

The next thing I realised was that I needed a real-world case of data, instead of generating random data at the beginning of the test. So I modified qMemEquals itself and dumped all the strings that the Qt Demo Browser tried to compare, along with the alignment properties of each string. Then I wrote a Perl script to generate the data for the benchmark.

After that, I spent several afternoons learning the Intel intrinsics and trying several different possibilities of optimising. You can see all of the attempts in the commits to the benchmark tool.

Finally, I wondered if I was optimising the right function, so I tried to find out how often qMemEquals is called. Turns out that the function most often called in qstring.cpp (and taking the most CPU time) isn't qMemEquals, but ucstrncmp. You can think of testing for equality (qMemEquals) as just being a special case of an ordering comparison (ucstrncmp). After having this new information, I went on to dump all calls to ucstrncmp and redid the optimisation work.

I don't have any graphs to show you, because after all they aren't improving on anything. But here's what I learned:

Alignment matters

The SSE2 instructions operate on 16-byte blocks and perform memory loads (and stores) much faster when the data is aligned on 16 bytes. However, the offset of the array component in QString::Data is 18 bytes, which means that 90-99% of all data being compared is misaligned by 2 or 10 bytes (on 32-bit platforms -- on x86-64 it would be only by 2 bytes). We can't fix this now in Qt 4, but it's something to think of for Qt5, so I left a comment in qstring.h pointing out that we should align the data. This will only help in so much as the system malloc function also returns 16-byte aligned memory, which it doesn't on 32-bit Linux.

This is especially true on Atom processors, where my best results for ucstrncmp are the "aligning" versions (ssse3_aligning2, sse2_aligning, ssse3_aligning).

Complexity matters (or "strings are too short")

When trying to solve the problem of misaligned loads by adding aligning prologs to the code, I discovered that the extra work required in the prolog completely or almost completly offset the benefit of having aligned loads. Even more, in most cases, the combined prolog + aligned comparison code was slower than the unaligned comparison code. Looking at the statistics on string size, turns out that most comparisons are awfully small (around 15 QChar for qMemEquals and 38 QChar for ucstrncmp). That means a 16-byte comparison loop would run only once or 3 times. Unlike for image data, which can easily be in the range of tens of kilobytes, the prolog and epilog cost is significant, so the complexity of those has a measurable result.

This also includes code for doing run-time dispatching according to the CPU. Runtime detection requires separate .cpp files per architecture feature level (that is, one for SSE2, one for SSE3, one for SSSE3 and one for SSE4.2), because otherwise the compiler might start using instructions from a higher level in code supposed to run on a lower level (don't suggest the gcc 4.4 #pragma and __attribute__ as a solution, see here why this doesn't work).

I think I spent more time optimising the prolog and epilogs, and toying with different approaches to aligning the data, than with the actual data-comparison code. I tried short-by-short advancing until aligning, I tried unaligned loads at the beginning of the strings, then continue at the first aligned position (which meant re-checking 6 or 14 QChars) and I tried aligning before the first load, which meant we loaded some garbage bytes before the beginning of the string (you'd need a valgrind suppression file if we used this code).

Atom processors are awfully slow (and don't have enough registers)

I did most of the work on my Core-i7 2.66 GHz workstation and ran the benchmarks there. When I turned to the netbook powered by an Atom N450 (that's 1.66 GHz), I found that the reality was very different. Not only were the benchmarks much slower for the same data, but also much more affected by alignment and complexity. Like I said before, the best results came from the aligning code -- and the best overall from the SSSE3-enabled code, which was 10% faster than the next contendant (sse2_aligning).

I also tried reading the assembly to figure out why they were slow and it's clear that in some cases it's because of register pressure: the number of general purpose registers available for 32-bit x86 PIC code is only 6 (eax, ecx, edx, esi, edi, ebp, and that's if the frame pointer is omitted; gcc refuses to use the PIC register even if it's completely not necessary), half of them callee-saved and the other half caller-saved. Half of them are taken by the data itself (the two pointers and the length), leaving 3 registers only for performing the prolog operations. So the code had to store & reload some intermediate results in memory, which greatly affects performance.

I envy the IA-64, with 128 general-purpose registers, or UltraSPARC with 32. A leaf function like these has available, without using the register rotation mechanisms, 9 scratch (caller-saved) registers on UltraSPARC (%g1, %g4, %g5, %o0-%o5), 9 on x86-64 (rax, rcx, rdx, rsi, rdi, r8-r11) and 27 on Itanium (r2, r3, r8-r11, r14-r31, plus the three input registers in0-in2 [r32-r34]). ARM has only 4 scratch registers, but there are 8 more callee-saved registers.

Intel & AMD: when are we going to get those extra 8 general-purpose registers available on 32-bit?

QString's explicit length makes a difference

While trying to learn how to use the new SSE4.2 string-comparison instruction, I could find no examples that used explicit length. Everything I found was using implicit length (that is, the length is not known a priori and a NUL marks the end). First of all, this makes a difference that two strings cannot be equal if they have different lengths, which means qMemEquals is called much less often than the public QString::operator== function. And even for ucstrncmp, we only need to compare the length of the smallest of the two strings (and if they are equal in that span, the shortest one is ordered first).

But there's apparently another big difference: when dealing with implicit lengths, one needs to check if the NUL was loaded in the 16-byte chunk loaded from memory and special-case that scenario. This is where I think that the new PCMPISTRI and PCMPISTRM instructions in SSE4.2 must shine, because they do the NUL-checking for you while comparing. I didn't see much improvement in the explicit versions of those instructions (PCMPESTRI and PCMPESTRM) compared to prior SSE code.

And adding to the previous point: the explicit-length instructions take two extra registers as input, which adds to register pressure.

Need some more help

I think I spent enough time trying this out. Given the constraints (strings are short), every optimisation counts to a better result. And I don't have the knowledge and the tools to improve this further. So what would you, dear reader, suggest to improve the performance?


Blog Topics:

Comments