Improving string performance with SIMD: the revenge (fromLatin1)

I had a couple of boring days last week, so to motivate myself I decided to do one of my favourite pastimes: reading disassembly code (no, I don't have a life). I usually do that to see how well the compiler has optimised certain sections of the code. By doing that, I can at the same time learn something new and discover points of potential improvement. Learning something new is what happened last week.

Reading disassembly only makes sense if you can make sense of what's there. So, of course, you need to understand assembly. But more than that, you need to understand what's going on. Reading complex files like the raster painting engine makes little sense, as everything is really mumbled up. Reading something easy like qstring.o however, does yield some results, like I had done last year

Last week I found out a new instruction I hadn't noticed before: PMOVZXBW, from the SSE4.1 extensions (Parallel MOVe with Zero eXtension from Byte to Word, the SIMD equivalent to the old "MOVZX" instruction). This instruction loads eight 8-bit values, zero-extends them to 16-bit and saves them in one of the SSE 128-bit registers. Well, zero-extension from 8 to 16-bit is exactly what QString::fromLatin1 does, so I decided to experiment with it.

If we look at the QString::fromLatin1 code, the generic version is:

  while (len--)
*dst++ = (uchar)*src++;

That's a simple "pull-up" operation, which is easily parallelisable. My first instinct would be to simply 8-byte loads followed by 16-byte stores (in the x86 world, "quadword" and "double-quadword"). That's the simplest case possible. But I remember from reading past processor manuals that they can do 16-byte loads, provided the right conditions are met. What's more, when I did try the above, it turned out to be slower than the current code, which already uses SSE2 on x86.

The current code doesn't do 8-byte loads. Instead, it does a 16-byte load and uses instructions available on SSE2 (as opposed to SSE4.1) to "unpack". Now, the instructions for unpacking were meant to interleave the values of two registers, like illustrated by the following figure, extracted from the Intel 64 and IA-32 Architectures Software Developer's Manual (page 4-281).

Byte-to-word low unpacking illustration

If the "Ys" in the figure above are all zeroes, we achieve a zero-extension like we wanted. Moreover, the processors have a "high unpacking" instruction that does the same for the upper half of the register. So, in effect, we load 16 bytes and store 32 on every loop iteration. The ratio is the same as my new solution, but we do more on each iteration. Benchmarking proves that this is a wiser option, since, like I said, modern processors can do 16-byte transfers in one go (under certain circumstances). Moreover, the overhead of looping is important, so the "unrolling" of the loop results in improved efficiency.

By the way, credit where credit is due: the current SSE2 code isn't mine. Benjamin Poulain did it for Qt 4.7.

So I tried a second approach: instead of doing an 8-byte load, how about load 16 bytes like the current code does? I wanted to avoid the unpacking instructions, as using them for zero-extension seemed like an overkill. So the idea was to load 16 bytes, zero extend the lower 8 bytes to 16, store it, then shift the upper 8 bytes into the lower ones (using another SSE4.1 instruction, PSRLDQ "packed shift right logical double-quad") and repeat the zero-extend+store. When I tested this, the result was approximately the same performance as the code using unpacking. In other words: it's not worth it, since SSE4.1 is only available on Core-i7 processors, whereas SSE2 is quite widespread.

At this point, I decided to try the rule of thumb of SSE processing: use aligned loads and stores. So I tried a series of "prolog" alignments, trying to make sure I was doing aligned stores. I chose the stores because I store issue two stores per iteration, but only one load. But just like I found out last year, the overhead of attempting to align the pointers isn't worth it. Since most strings being converted are quite small (average 17.82 characters long), the complexity of the prologs offset the benefit gained from the aligned store, and we still need to do an unaligned load. To really have an effect, the strings would probably need to be much larger than 16 bytes (8 characters) plus the length of the prolog (which, in QString's case, is 95% of the case 7 or 3 characters).

An interesting thing did come out of this, though. Aligning the prolog (the data before the SIMD operations) didn't result in perceptible gain, but unrolling the epilog certainly did. The code that we currently have in Qt 4.7 simply drops to the generic loop (see line 3782). However, I know that the tail part of the operation is at most 15 characters long, so I just unrolled it manually in a very long function. That plus some other simple things to tell the compiler to do less in each loop, did improve performance.

Finally, I decided to look into the ARM side of things. When Benjamin did the work for Qt 4.7, he concluded that the improvement on ARM wasn't worth the effort. Taking a second look at things, I managed to improve the performance by around 25%, a bit more if I write the assembly code by hand.

The ARM Neon instruction set contains a very interesting combinations of load/store instructions with interleaving, mostly meant for image processing. For us, we're interested in loading non-interleaved data, so we're limited to 8- or 16-byte loads (in the ARM world, "doubleword" and "quadword"). The first code I wrote uses an 8-byte load, a "move long unsigned" (that is, zero-extend 8- to 16-bit), and a 16-byte store. Since on x86 it helped, I tried to do a 16-byte load, two VMOVL.U8 and two stores. However, on the Cortex-A8 processor I have for testing, it didn't improve (it became slightly worse).

Finally, noting that the Neon instructions for loading and storing have a way of updating the address register by adding the amount of bytes transferred (saving me the need to add myself), I wrote the assembly code by hand to see if it improved. It did, but only by about 3%.

This blog wouldn't be complete without a graph showing the numbers I got:

results

All the tests were run in 32-bit, even though all of the x86 CPUs I had could run 64-bit code (even the Atom one). Since these are very different systems, it doesn't make sense to compare one to the other, so the first group of bars is the "baseline" and is 100% for all of them. The only other group of bars that is present in all CPUs is the "SIMD improved" one, which is what I want to introduce to Qt 4.8.

The "SSE2 Qt 4.7" column is what you get if you compile Qt for systems with SSE2 or better. This is not a run-time decision: it's a compile-time one. Most Linux distributions will not have this enabled (I know MeeGo does).

Also avoid making a false conclusion when comparing the ICC results with GCC. GCC had a bigger improvement, but it's incorrect to say that it's better. It just happens that the "baseline" for GCC is higher. The relative values are quite stable, but the absolute ones aren't, so it's difficult to compare the two. But they are within 10% of one another.

The full source code is available, of course. See the functions "fromLatin1_*" and the fromLatin1Alternatives test.

In the next blog, I'll address the fromUtf8 operation. Stay tuned.


Blog Topics:

Comments