Improving the rendering performance with more SIMD

Published Tuesday August 24th, 2010
24 Comments on Improving the rendering performance with more SIMD
Posted in Assembly, C++, Painting, Performance, WebKit

With the last two versions of Qt, we consistently improved performance. Qt 4.5 introduced pluggable graphics systems and numerous rendering optimizations. Qt 4.6 brought optimizations all over the place, and the performance on embedded improved continuously with each patch release.

A problem with increasing the speed all the time is that we fall short on ways to improve for the next iterations. We have to look for new areas of improvement, and once again we are making Qt 4.7 faster than its predecessors.

Single instruction, multiple data

One of ways we used to get Qt 4.7 faster than its predecessor is using the processors more effectively. Modern processors have ways to execute an instruction on multiple data at a time. This is called single instruction, multiple data: SIMD. In particular, recent x86 processors have SSE extensions, while ARM Cortex have Neon.

The principle is simple. Let’s see a use case were we have simple operations operating on multiple data:


quint32 a[256];
quint32 b[256];
quint32 c[256];
// [...]

for (int i = 0; i < 256; ++i) {
c[i] = a[i] + b[i];
}
On processors supporting SIMD, this code can be improved by applying the instructions on multiple data. For example, with SSE2, the following code loads 4 data at a time, applies the + operation, and stores the value in c:


quint32 a[256];
quint32 b[256];
quint32 c[256];
// [...]

for (int i = 0; i < 256; i += 4) {
__m128i vectorA = _mm_loadu_si128((__m128i*)&a[i]);
__m128i vectorB = _mm_loadu_si128((__m128i*)&b[i]);
__m128i vectorC = _mm_add_epi32(vectorA, vectorB);
_mm_storeu_si128((__m128i*)&c[i], vectorC);
}
The code above contains instrinsics which the compiler replaces with SSE2 instructions.

This example is so simple the compiler can optimize it automatically when passed the right options. But in most real cases, the change is not that obvious, and the algorithm needs to be slightly modified to work with vectors.

Qt has used SIMD for a long time, using MMX and 3DNow! for example. In Qt 4.7, we extended our usage of SSE on x86, and of Neon on ARM Cortex processors. By using SIMD in more places, we’ve gained between 2 and 4 times the speed in some uses cases.

Improving raster

In Qt 4.7, lots of rendering primitives have been reimplemented using SSE and Neon. This affects the raster graphics system in a very positive way.

The functions rewritten for SIMD are generally 2 to 4 times faster than the generic implementation. Microbenchmarks can be misleading, so to measure the impact on a realistic use case, I’ve used the WebKit benchmark suite.

On the “scrolling” test, we load over the top 50 most visited web pages and scroll them up and down. For this test I get the following improvement compared to Qt 4.6 compiled without any SIMD:
Performance improvement of Qt 4.7

The tests have been run with the same version of QtWebKit (WebKit trunk) in all cases to remove the influence of the improvements done in the engine.

Compiling with SIMD

You do not have to do anything special to enjoy those improvements of Qt. When you build Qt, the configure script detects which features are supported by the compiler. You can see which extension are supported in the summary printed on the command line.

Supporting the CPU extensions at compile time does not mean they will be used. When an application starts, Qt detects what is available, and sets up the fastest functions available for the current processor.

With more SSE, we have more code sensitive to alignment. Unfortunately, some compilers have bugs regarding the alignment of vectors. Having a recent compiler is a good idea to get the best performance, while avoiding crashes.

Future

We are not done with improvements just yet. The most common functions have been optimized, but lots of less common paths can also be improved. For the last month, every week I think I am almost done, and Andreas pokes me with a new interesting use case. Those improvements are making their way to the 4.7 branch, and you can already expect 4.7.1 to be a little faster than the upcoming Qt 4.7.0.

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in Assembly, C++, Painting, Performance, WebKit

24 comments

Tsiolkovsky says:

I’m always glad to see all these performance improvements in Qt. Do you also use KDE apps (like games) and desktop as usecases for further improvement? Oh and what about X11/XRender backend. Is this one getting any love recently?

laborer2008 says:

What you say about platform-specific SIMD (e.g. IWMMX)?

mkretz says:

(Experimenting with this comment system until I get it to display my text…)

Hi, nice to see SIMD getting more use! Working with SSE is my daily bread and I can only confirm that compilers can be a large obstacle… πŸ™

Two remarks:
1. In the code you posted I recommend to use aligned loads and stores if at all possible:

quint32 a[256] __attribute__((aligned(16));
quint32 b[256] __attribute__((aligned(16));
quint32 c[256] __attribute__((aligned(16));
// […]
for (int i = 0; i
Then instead of three movdqu + one paddd instruction you’ll get two movdqa + one paddd instruction because the paddd can take one memory argument (but only if it’s aligned). Also movdqa is faster than movdqu, especially when you get cacheline splits (which you get on every fourth address: 64 Bytes/16 Bytes).

If you don’t have the memory allocation under your control it is often still better to initially do a few scalar calculations and then execute the loop with aligned memory. Alternatively you can do aligned loads and use palignr to get the unaligned vectors.

2. You might be interested in http://gitorious.org/vc or http://compeng.uni-frankfurt.de/index.php?id=vc

Benjamin says:

@Tsiolkovsky
I profile KDE apps from time to time actually. I have never looked into games.

@laborer2008
Any extension can be supported by writing the dedicated functions of qdrawhelper. IWMMX is not our main focus.

@mkretz
Yep, aligned load are important to maximize the performance. I just try to keep the example simple. In the code of Qt, we align the vectors when that makes sense.

mkretz says:

(This blog doesn’t like my comments)

Since you confirmed already that you’re using aligned access where possible I will leave out the rest of my first remark…

But I still wanted to point you at http://gitorious.org/vc or http://compeng.uni-frankfurt.de/index.php?id=vc . It might be interesting to incorporate some (all?) of the ideas there to make your code nicer. Or even internally use a copy of Vc to make your code more readable. Anyway, keep on vectorizing… πŸ™‚

justin says:

Does the SIMD support include Altivec/VMX for PowerPC?

Will Qt 4.7 make use of the vector capabilities of modern ARM chipsets?

Benjamin says:

I had a quick look at Vc, I am affraid the license is an obstacle since it will conflict with the commercial offers.

We also had to fight with pragmatic problems of compilers. For example, we had to replace the template and inline function by macro, due to a bug in a popular compiler. Since we don’t choose which compiler we support, we have to find some nasty work around sometimes.

We got some help from Intel regarding the performance of our functions. Zvi, an engineer for Intel, helped us to improve our blending function. I invite you to look at the SSE code of Qt, you could also find ways to improve the code.

Benjamin says:

> Does the SIMD support include Altivec/VMX for PowerPC?

No, we don’t have support for Altivec.

> Will Qt 4.7 make use of the vector capabilities of modern ARM chipsets?

The instruction set “Neon”, mentioned in the blog post, is specific to modern ARM. Samuel continuously implemented new optimized function for ARM since Qt 4.6.0. Some of the performance update for ARM are already in Qt 4.6.2 and 4.6.3.

Given the strong position Qt has for phones and embedded system, the support for ARM is actively improved.

SABROG says:

When i last try compile Qt 4.6.0 with MMX/SSE2 instructions on MinGW compiler i get runtime crash. SSE/3DNow etc it’s good, but how about technologies like CUDA?

bibr says:

Way to go, Benjamin! Any useful references for those more interested in learning how to use SIMD in general?

mkretz says:

Re: Vc license
Since Vc is all inline code the LGPL here becomes much less restrictive (like for Eigen): http://eigen.tuxfamily.org/index.php?title=Licensing_FAQ#But_the_LGPL_is_a_very_complex_license.21
In any case, if you really find this useful just talk to me about it. I own the copyright and add more licenses if needed. πŸ™‚ And you already incorporated my LGPL Phonon code… πŸ™‚

@bibr:
Look for an icc manual. It includes documentation for all SSE intrinsics. You can find more good information in the Intel Optimization Reference Manual (more than just about vectorization). And then I recommend you take a look at the Vc examples: http://gitorious.org/vc/vc/trees/master/examples . It will give you an idea how SIMD can be applied to a given problem.

Olivier Goffart says:

@Tsiolkovsky No, the X11/XRender paint engine is not getting any love.

Benjamin says:

> but how about technologies like CUDA?

CUDA and OpenCL are other kind of beast. They can’t really be applied to the same range of problems.

> Any useful references for those more interested in learning how to use SIMD in general?

I think the documentation is quite poor on the subject. Reading function in their C version, next to the Neon version made by Samuel really helped me to understand the tricks to bend certain problems to vectorization.

SABROG says:

If Nokia is working closely with Intel, then perhaps they will provide experts who can suggest how best to optimize the code for their processors?

Josh says:

Is this code already in 4.7 Beta 2 (or Beta 1 for that matter), or can we expect further speedups in this area when 4.7 final is released? Thanks!

Benjamin says:

@SABROG
We are starting to get some help from Intel. Hopefully we will have more in the future.

> Is this code already in 4.7 Beta 2 (or Beta 1 for that matter), or can we expect further speedups in this area when 4.7 final is released?

For ARM with Neon, Samuel already improved the 32 bits image blending for Qt 4.6 already.
-beta 2: improve text blending, image blending, transformed image blending and blending of ARGB32_PM over RGB16
-4.7.0: improve memrotate, speed up rotation of images by a multiple of 90%. Improve the source over solid (for QPainter::fillRect() for example).
-4.7.1: will improve the composition mode plus

For x86 and x86_64, you have the following:
-beta 2: improve the blending of images with the composition mode source over (QPainter::drawImage() and QPainter::drawPixmap())
-4.7.0: improve QPainter::fillRect(), and the general blending with SourceOver
-4.7.1: will improve image decoding and color conversions (including the work of a contributor: John Brooks). Another round of speed improvement of blending by using SSSE3. Improve the composition mode “Plus”.

Benjamin says:

To be complete, I should mention the data of the chart have been generated with the current branch of 4.7, which will be 4.7.1, not the branch 4.7.0. The rendering of 4.7.1 is already a bit faster than 4.7.0.

mariuz says:

why not improving rendering performance on openGL or using opencl for raster , just thinking
I wonder why rendering is slower with opengl in some cases , that would be a nice article

Benjamin says:

@mariuz
Tom did a very good post on rendering QPainter calls with openGL : http://labs.trolltech.com/blogs/2010/01/06/qt-graphics-and-performance-opengl/
It is even worse with openCL.

There is reasearch being done to use openGL to render very complex scene where raster do not shine. For example, No’am is implementing support for compositing of WebKit layers with openGL. The graphic team is looking into new way to make scenes to make the best out of openGL: http://labs.trolltech.com/blogs/2010/05/18/a-qt-scenegraph/

Romain says:

Excellent work !

I’m making a strong use of SIMD with QImage painting in 4.6.3 both on x86/Atom and ARM/Cortex, so I can’t wait to see it work even better with 4.7 πŸ™‚

Thanks a lot for the “SIMD roadmap” in the comments, very useful piece of info.

One suggestion if you get bored after 4.7.1 : accelerated scrolling for QGraphicsItem. Currently there’s no real scrolling acceleration, it’s just painting the item again. But for cached items in some cases it’s possible to use a memmove()-based scrolling, or maybe a SIMD equivalent.

mariuz says:

I have tested the composting speed on google chrome 7.0.x and seems to be the fastest thing i have one my pc

http://lug-mures.googlegroups.com/web/chrome7.0.503.1-enable-accelerated-compositing.png

there are some rendering issues but overall i think that is the path do all the rendering in opengl ,
and maybe something is wrong in the way qt is doing opengl rendering in the current versions

Vincenzo says:

Ulrich Drepper has written a pretty impressive article about cpu and memory
http://www.akkadia.org/drepper/cpumemory.pdf

Benjamin says:

@mariuz
You are mixing concepts here. The accelerated compositing you mention is what No’am already blogged about, not the CPU rasterization. And as discussed before, doing everyting with GL is a bad idea with the current scene, and that is not what Chrome does.

@Vincenzo
Thanks for the article, I will have a look.

Commenting closed.

Get started today with Qt Download now