Improving string performance with SIMD: the revenge (fromLatin1)

Published Wednesday March 23rd, 2011
17 Comments on Improving string performance with SIMD: the revenge (fromLatin1)
Posted in Assembly, Performance, Qt

I had a couple of boring days last week, so to motivate myself I decided to do one of my favourite pastimes: reading disassembly code (no, I don’t have a life). I usually do that to see how well the compiler has optimised certain sections of the code. By doing that, I can at the same time learn something new and discover points of potential improvement. Learning something new is what happened last week.

Reading disassembly only makes sense if you can make sense of what’s there. So, of course, you need to understand assembly. But more than that, you need to understand what’s going on. Reading complex files like the raster painting engine makes little sense, as everything is really mumbled up. Reading something easy like qstring.o however, does yield some results, like I had done last year

Last week I found out a new instruction I hadn’t noticed before: PMOVZXBW, from the SSE4.1 extensions (Parallel MOVe with Zero eXtension from Byte to Word, the SIMD equivalent to the old “MOVZX” instruction). This instruction loads eight 8-bit values, zero-extends them to 16-bit and saves them in one of the SSE 128-bit registers. Well, zero-extension from 8 to 16-bit is exactly what QString::fromLatin1 does, so I decided to experiment with it.

If we look at the QString::fromLatin1 code, the generic version is:

  while (len--)
    *dst++ = (uchar)*src++;

That’s a simple “pull-up” operation, which is easily parallelisable. My first instinct would be to simply 8-byte loads followed by 16-byte stores (in the x86 world, “quadword” and “double-quadword”). That’s the simplest case possible. But I remember from reading past processor manuals that they can do 16-byte loads, provided the right conditions are met. What’s more, when I did try the above, it turned out to be slower than the current code, which already uses SSE2 on x86.

The current code doesn’t do 8-byte loads. Instead, it does a 16-byte load and uses instructions available on SSE2 (as opposed to SSE4.1) to “unpack”. Now, the instructions for unpacking were meant to interleave the values of two registers, like illustrated by the following figure, extracted from the Intel 64 and IA-32 Architectures Software Developer’s Manual (page 4-281).

Byte-to-word low unpacking illustration

If the “Ys” in the figure above are all zeroes, we achieve a zero-extension like we wanted. Moreover, the processors have a “high unpacking” instruction that does the same for the upper half of the register. So, in effect, we load 16 bytes and store 32 on every loop iteration. The ratio is the same as my new solution, but we do more on each iteration. Benchmarking proves that this is a wiser option, since, like I said, modern processors can do 16-byte transfers in one go (under certain circumstances). Moreover, the overhead of looping is important, so the “unrolling” of the loop results in improved efficiency.

By the way, credit where credit is due: the current SSE2 code isn’t mine. Benjamin Poulain did it for Qt 4.7.

So I tried a second approach: instead of doing an 8-byte load, how about load 16 bytes like the current code does? I wanted to avoid the unpacking instructions, as using them for zero-extension seemed like an overkill. So the idea was to load 16 bytes, zero extend the lower 8 bytes to 16, store it, then shift the upper 8 bytes into the lower ones (using another SSE4.1 instruction, PSRLDQ “packed shift right logical double-quad”) and repeat the zero-extend+store. When I tested this, the result was approximately the same performance as the code using unpacking. In other words: it’s not worth it, since SSE4.1 is only available on Core-i7 processors, whereas SSE2 is quite widespread.

At this point, I decided to try the rule of thumb of SSE processing: use aligned loads and stores. So I tried a series of “prolog” alignments, trying to make sure I was doing aligned stores. I chose the stores because I store issue two stores per iteration, but only one load. But just like I found out last year, the overhead of attempting to align the pointers isn’t worth it. Since most strings being converted are quite small (average 17.82 characters long), the complexity of the prologs offset the benefit gained from the aligned store, and we still need to do an unaligned load. To really have an effect, the strings would probably need to be much larger than 16 bytes (8 characters) plus the length of the prolog (which, in QString’s case, is 95% of the case 7 or 3 characters).

An interesting thing did come out of this, though. Aligning the prolog (the data before the SIMD operations) didn’t result in perceptible gain, but unrolling the epilog certainly did. The code that we currently have in Qt 4.7 simply drops to the generic loop (see line 3782). However, I know that the tail part of the operation is at most 15 characters long, so I just unrolled it manually in a very long function. That plus some other simple things to tell the compiler to do less in each loop, did improve performance.

Finally, I decided to look into the ARM side of things. When Benjamin did the work for Qt 4.7, he concluded that the improvement on ARM wasn’t worth the effort. Taking a second look at things, I managed to improve the performance by around 25%, a bit more if I write the assembly code by hand.

The ARM Neon instruction set contains a very interesting combinations of load/store instructions with interleaving, mostly meant for image processing. For us, we’re interested in loading non-interleaved data, so we’re limited to 8- or 16-byte loads (in the ARM world, “doubleword” and “quadword”). The first code I wrote uses an 8-byte load, a “move long unsigned” (that is, zero-extend 8- to 16-bit), and a 16-byte store. Since on x86 it helped, I tried to do a 16-byte load, two VMOVL.U8 and two stores. However, on the Cortex-A8 processor I have for testing, it didn’t improve (it became slightly worse).

Finally, noting that the Neon instructions for loading and storing have a way of updating the address register by adding the amount of bytes transferred (saving me the need to add myself), I wrote the assembly code by hand to see if it improved. It did, but only by about 3%.

This blog wouldn’t be complete without a graph showing the numbers I got:

results

All the tests were run in 32-bit, even though all of the x86 CPUs I had could run 64-bit code (even the Atom one). Since these are very different systems, it doesn’t make sense to compare one to the other, so the first group of bars is the “baseline” and is 100% for all of them. The only other group of bars that is present in all CPUs is the “SIMD improved” one, which is what I want to introduce to Qt 4.8.

The “SSE2 Qt 4.7” column is what you get if you compile Qt for systems with SSE2 or better. This is not a run-time decision: it’s a compile-time one. Most Linux distributions will not have this enabled (I know MeeGo does).

Also avoid making a false conclusion when comparing the ICC results with GCC. GCC had a bigger improvement, but it’s incorrect to say that it’s better. It just happens that the “baseline” for GCC is higher. The relative values are quite stable, but the absolute ones aren’t, so it’s difficult to compare the two. But they are within 10% of one another.

The full source code is available, of course. See the functions “fromLatin1_*” and the fromLatin1Alternatives test.

In the next blog, I’ll address the fromUtf8 operation. Stay tuned.

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in Assembly, Performance, Qt

17 comments

Matthias Kretz says:

BTW, you can even do one 32 Byte load per cycle now on Sandy-Bridge. Also the Family10h AMD CPUs (Barcelona arch) can almost execute 2 16 Byte loads per cycle (I measure ~26 Bytes/cycle in my benchmark).

In this specific case it’s probably more interesting to look at the store throughput since you have to store twice the amount than you have to read. But for L1, you can make a best case approximation with 16 Bytes/cycle. Since the load, store, and unpack instructions can run in parallel I don’t see an apparent reason why (for large strings, but smaller than L1) you wouldn’t be able to reach a throughput of 8 chars/cycle (16 Bytes/cycle stores as the bottleneck). unpack requires has a 1 cycle latency IIRC, so you should aim for having ~5 SSE registers in flight for best ILP. But like you said: all this only makes sense for large strings.

Thiago Macieira says:

@Matthias: I am not measuring the cycle throughput, but instead on the overall impact on an application. To do that, I dumped every string that went through fromLatin1 in an application and then the benchmark runs over it.

At 16 bytes per cycle of loading, we’d do a single cycle in the average case, so I don’t see the value of unrolling. Even for 16 bytes/cycle of storing, it might not help as we’d do then two iterations only. In fact, using -funroll-loops with GCC does show a small penalty in performance compared to not having it.

Robin Lobel says:

Seriously, I hope you did not spent too much time for a 20% speed gain on a very specific string conversion routine… :S

Thiago Macieira says:

@Robin: a 25-33% speed gain (80-75% of the previous time) in a routine that is called everywhere

Philippe says:

Great ๐Ÿ™‚ Little streams make mighty rivers…

Matthias Kretz says:

@Thiago: I’m sure you’re doing the right thing. I didn’t want to imply anything else. And yes, having real-world test-data for the benchmark sounds like the best thing to do.
You need to unroll to get ILP. Well, that’s oversimplified – but a good rule of thumb. E.g.
pxor %xmm0, %xmm0
loop:
movdqa (%rsi), %xmm1
movdqa %xmm1, %xmm2
psrldq %xmm2, 8
punpcklbw %xmm0, %xmm1
punpcklbw %xmm0, %xmm2
movdqa %xmm1, (%rdi)
movdqa %xmm2, 0x10(%rdi)
jmp loop

On Intel, the shift and unpack instructions have a 1 cycle latency and a throughput of 0.5 on Nehalem and later/ 1 on Core/Core2. The movdqa has 1 cycle latency and a throughput of 0.33 (only reg-reg, for reg-mem the throughput is 1). Thus from load to store of xmm2 the code requires 5 cycles. So you get a conversion rate of 3.2 chars/cycle. If we unroll once:
loop:
1 movdqa (%rsi), %xmm1
2 movdqa %xmm1, %xmm2
2 movdqa 0x10(%rsi), %xmm3
3 psrldq %xmm2, 8
3 punpcklbw %xmm0, %xmm1
4 movdqa %xmm1, (%rdi)
4 movdqa %xmm3, %xmm4
4 psrldq %xmm4, 8
4 punpcklbw %xmm0, %xmm2
5 movdqa %xmm2, 0x10(%rdi)
5 punpcklbw %xmm0, %xmm3
5 punpcklbw %xmm0, %xmm4
6 movdqa %xmm3, 0x20(%rdi)
7 movdqa %xmm4, 0x30(%rdi)
jmp loop

I prefixed the instructions with the cycle in which I expect them to be executed. Thus we do 32 chars in 7 cycles = 4.57 chars/cycle.

Now all this is rather theoretical because your strings are short and you don’t want to have branches that make your code slower again. ๐Ÿ™‚

OK, back to real work again…

Chris says:

I’d be interested to know how this compares to doing a memcpy.

Konstantin Tokarev says:

@Thiago:

>itโ€™s not worth it, since SSE4.1 is only available on Core-i7 processors

I’m afraid you are confusing it with SSE 4.2 because SSE 4.1 is available on Core 2 too (except oldest series)

>Also avoid making a false conclusion when comparing the ICC results with GCC. GCC had a bigger improvement, but itโ€™s incorrect to say that itโ€™s better. It just happens that the โ€œbaselineโ€ for GCC is higher.

I’d rather said it means “ICC uses SSE better out of the box while GCC requires manual assembly-level coding”

Thiago Macieira says:

@Matthias: I’ve just tried that now. It does improve a little in the “application average” with GCC by 1.3%, but with ICC it becomes slightly worse (3.5%). Both ICC results are better than both GCC results in these runs, but within 4% of the time.

Note I’m not writing assembly by hand, but instead I’m using intrinsics and letting the compiler schedule the instructions as it sees fit.

ICC disassembly (%edx = counter; %ebx = src; %ecx = dest; %edi = len):

    e2ec:       movdqu (%edx,%ebx,1),%xmm2
    e2f1:       movdqu 0x10(%edx,%ebx,1),%xmm4
    e2f7:       movdqa %xmm2,%xmm1
    e2fb:       movdqa %xmm4,%xmm3
    e2ff:       punpcklbw %xmm0,%xmm1
    e303:       punpckhbw %xmm0,%xmm2
    e307:       punpcklbw %xmm0,%xmm3
    e30b:       punpckhbw %xmm0,%xmm4
    e30f:       movdqu %xmm1,(%ecx,%edx,2)
    e314:       movdqu %xmm2,0x10(%ecx,%edx,2)
    e31a:       movdqu %xmm3,0x20(%ecx,%edx,2)
    e320:       movdqu %xmm4,0x30(%ecx,%edx,2)
    e326:       add    $0x20,%edx
    e329:       cmp    %edi,%edx
    e32b:       jle    e2ec

GCC dump (%edx = counter; %edi = src; %esi = src + 16; %eax = dest; %ecx = len):

    53a0:       movdqu (%edi,%edx,1),%xmm1
    53a5:       movdqu (%esi,%edx,1),%xmm2
    53aa:       add    $0x20,%edx
    53ad:       movdqa %xmm2,%xmm3
    53b1:       punpckhbw %xmm0,%xmm2
    53b5:       punpcklbw %xmm0,%xmm3
    53b9:       movdqu %xmm2,0x10(%eax)
    53be:       movdqu %xmm3,(%eax)
    53c2:       movdqa %xmm1,%xmm2
    53c6:       punpckhbw %xmm0,%xmm1
    53ca:       punpcklbw %xmm0,%xmm2
    53ce:       movdqu %xmm1,0x30(%eax)
    53d3:       movdqu %xmm2,0x20(%eax)
    53d8:       add    $0x40,%eax
    53db:       cmp    %edx,%ecx
    53dd:       jge    53a0
Thiago Macieira says:

@Chris: memcpy does a 1:1 copy. I need a 1:2 pull-up.

Thiago Macieira says:

@Konstantin: well, my 3-year-old Core2 doesn’t have SSE4.1 or 4.2. Even if you’re right, the point is still the same: SSE4.1 is much less widespread than SSE2 is.

Konstantin Tokarev says:

@Thiago:
Just FYI
model name : Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority

dmidecode -s bios-release-date
04/29/2009

Matthias Kretz says:

@Core2: There’s a 65nm version (Merom) and a 45nm version (Penryn). The former doesn’t have SSE4.1 the latter has it. And, of course, AMD doesn’t have it yet; but the upcoming Bulldozer will have it (and AVX, too). To make things even more confusing, there’s Intel Wolfdale and Intel Penryn, which have the same family and model numbers, but Wolfdale doesn’t support SSE4.1 either.

Matthias Kretz says:

@Thiago: a right, of course you don’t need the psrldq… ๐Ÿ™‚

I also only write code with intrinsics and then double check the generated asm. Inline-asm is a last resort that always bites back. But I also learned to always double-check on critical code. There are also rather important differences depending on your GCC version. So far GCC 4.6.0 RC1 comes up best in the few tests I looked into.

ICC generated some nice array indexing code, ๐Ÿ™‚ removing one add instruction. The add should be able to run in parallel, I think. I can’t say where a performance difference would come from then. Though I’m confused why ICC doesn’t align the loop beginning to 16 Bytes. This missing alignment might kill performance if you benchmark it on AMD. Dunno, whether the LSD of AMD is as good as on the Intels.

Thiago Macieira says:

@Matthias: I’m pretty sure the Intel compiler doesn’t really care for what makes AMD fast… ๐Ÿ™‚

By the way, I carefully wrote the C++ code so that the compiler would use the array-indexing as much as possible. GCC seems to do it in other places, but this time didn’t do it.

In any case, I don’t think this inner loop is where the problem is, but instead I think it’s elsewhere in the function. Notable differences are:

  • ICC seems to omit the frame pointer on its own
  • I haven’t tested in a library context yet, but I’m pretty sure ICC will dispense with the use of the PIC register in this function, whereas GCC will load it and not use %ebx
  • ICC saves registers %ebx, %esi and %edi in the prologue, whereas GCC saves %ebp, %esi and %edi. Considering GCC does keep %ebp as the frame pointer, it’s actually using one register less than ICC (and doesn’t seem to need %ebx, so the PIC register wouldn’t be an effect).
  • GCC does some weird shifting left and right to get the src and dest pointers back after the loop, whereas ICC does simple array indexing with the “counter” variable. Probably caused by trying to save on register consumption
  • GCC does reload the “len” variable from the stack just prior to the epilogue, whereas ICC keeps it in a register all the time

In any case, we’re talking about roughly 5% of difference in performance only.

Uning says:

I think the major problem of Qt performance is IO(like library and graphics loading) and too many wrappers of classes. But you have done a great work any way!

Gordon Schumacher says:

@Uning:
Last time I checked, the hands-down #1 performance hit in Linux for a GUI app was trawling through the lookup tables in Harfbuzz! Mind you, this was a year ago or so, and it was a problem already recognized by the Harfbuzz devs – but I don’t know that anyone was working on it, either.

Someday I will have free time again.

Commenting closed.

Get started today with Qt Download now