Qt Graphics and Performance - The Raster Engine

Todays topic is the raster engine, Qt's software rasterizer. Its the reference implementation and the only paint engine that implements all possible feature combinations that QPainter offers.

History

The story of Qt's software engine started around December 2004, if my memory serves me. My colleague Trond and I had been working for a while on the new painting architecture for Qt 4, codenamed "Arthur". Trond had been working on the X11 and OpenGL 1.x engines and I was focusing on the combined Win32 GDI/GDI+ engine along with QPainter and surrounding APIs. We had introduced a few new features, such as antialiasing, alpha transparency for QColor, full world transformation support and linear gradients. As few of these new features were supported by GDI, it meant that using any of these features implied switching to GDI+, which at the time was insanely slow, at least on all the machines we had in the Oslo office back then. Actually, enabling the GDI advanced graphics mode to do transformations was also not very fast.

Then we came upon this toolkit called Anti-Grain Geometry (AGG) which did everything in software, in plain C++, and we were just amazed at what it could do. Our immediate reaction was to curl up on the floor in agony, thinking that we were going about this all wrong. Using these native API's was not helping us at all. In fact it was preventing us from getting the feature set we wanted with a performance that was acceptable. Once we settled down again, our first idea was to try to implement a custom AGG paint engine which would just delegate all drawing into the AGG pipeline. But alas, the template nature of the AGG API combined with the extremely generic QPainter API bloated up into a pipeline that didn't perform nearly as good as the demos we had seen.

So we took our Christmas vacation and started over in January of 2005. Still quite depressed over the new feature set that didn't perform combined with being limited by a minimal subset of native API's, I went to Matthias and Lars and asked if I could get three weeks of time to hack together a software only paint engine as a proof of concept. I got an "OK" and spent the following weeks implementing software pixmap transformation, bi-linear filtering, clipping support in the crudest possible way and three weeks later I had a running software paint engine and quite proudly announced that I was "just about done". I've reconstructed an image of how I remember it:

groupboxes

The system clipping was all over the place, bitmap patterns were broken, but perhaps worst of all, all text is rendered using QPainterPath's, and all drawing was antialiased. Despite it not looking 100% good, the performance of the various features was pretty ok. It was agreed that this was a good start, but that we needed a bit more work. And so started the sprint for the Qt 4.0 beta a few months later.

The initial version that was released with Qt 4.0 worked quite well in terms of features, but in hindsight the performance was far from what our users demanded from Qt. As a result, we harvested a lot of criticism over the first year of Qt 4.0. Since then, we've done a lot, and I mean a LOT, and my gut feeling is that it is the engine that performs the best for average Qt usage, so I think we made a good choice back then in dropping GDI and GDI+. And, as I outlined in my previous post, we are toying with making raster the default across all desktop systems for the sake of speed and consistency.

Overall structure

The overall structure of the engine is that all drawing is decomposed into horizontal bands with a coverage value, called spans. Many spans will together form the "mask" for a shape and each pixel that is inside the mask is filled using a span function.

antialiasing

The image highlights one scanline of a polygon which is filled with a linear gradient. There are 4 spans, one which fades in the opacity of the polygon and two which fade out the opacity of the gradient. For each pixel in the polygon, the gradient function is called and we write the pixel to the destination, possibly alpha blending it, if the coverage value is other than full opacity or if the pixel we got from the gradient function contains alpha.

Clipping also use the same mechanism. The span function for clipping takes the incoming spans, intersects them with the set of spans that defines the clip and calls the actual filling span function.

clipspans

All operations followed this pattern. When a drawRect call comes in, we generate a list of spans for each scan line and set up a span function according to the current brush. A pixmap is similar, we create a list of spans and use a pixmap span function. A polygon is passed to a scanconverter which produces a span list, etc. We have two scan converters, one for antialiased and one for aliased drawing. The antialiased one is pretty much a fork of FreeType's grayraster.c, with some minor tweaks, I think we needed to add support odd-even fills, for instance. Text is also converted into spans.

Lines, Polylines and Path Strokes

These primitives are passed to a separate processor called a stroker. The stroker creates a new path that visually matches the fillable shape that the outline represents. There is a public API for this too, in QPainterPathStroker. This fillable shape is then passed to one of the scan converters which in turn scan converts the shape into spans. For dashed outlines, the same process happens, and the resulting fillable shape is a path with a potentially very large amount of subpaths. Naturally, such a sub-path is costly to scan convert, which is part of the reason why we explicitly do not put dashed lines on the list of high-performance features. In fact, in many cases, line dashing is one of the slowest operations available in the raster engine, so use it with extreme caution.

A hacky alternative which performs much better, is to set a 2x2 black/white or black/transparent pixmap brush and draw the stroke using a pen with brush. A bit more to set up, but if that's what it takes to get in running fast, then that's what it takes.

State changes

Any setBrush, setTransform or any other state change on QPainter will result in a different set of span functions being set up. Each brush, or fill-type if you like as pens on this level are essentially just fills too, has a special span function associated with it and we also pass a per brush span data. For solid color fills the span data contains the color, for transformed pixmap drawing it contains the inverse matrix, a source pixel pointer, bytes per line and other required information. For clips it contains the span function to call after you clipped the spans. The thing to notice about state changes is that each time you switch from one brush to another brush or from one transformation to another, these structures do need to be updated. Up to Qt 4.4, this was in many cases a noticeable performance problem, bubbling up to 10-15% in profilers when rendering graphics view scenes, but since 4.5 the impact of this is minimal.

Well, perhaps not minimal compared to drawing a 2 pixel long line, but minimal compared to filling a 64x64 rectangle. The point is that though the raster engine is the engine that probably handles state changes best of all our engines, there are some usecases where it still shows up, and it should still be minimized.

Span functions

The task of the span functions is to generate a pixel and combine it with the destination according to the current state of the painter. Though the raster engine supports rendering to any of our image formats except 8-bit indexed, it will internally do all rendering in ARGB32_Premultiplied. Premultiplied alpha has the benefit that we don't have to multiply the alpha into the color channels and it saves us a division in the blending. The reason for doing all rendering in one format is that the alternative simply doesn't scale. Just think of the combination of composition modes multiplied with the number of image formats a source image can have multiplied with what formats the destination can have. To support all combinations we have a generic approach where we for each span do:

  • Get the source pixels, e.g. from a gradient, pixmap, image or solid color, and convert them to ARGB32_Premultiplied.
  • Get the destination pixels and convert them to ARGB32_Premultiplied
  • Blend the source into the destination using current composition mode
  • Convert the result to destination format and write it back.

This may seem like a lot of work, so luckily the story doesn't end there.

Special casing and Optimizations

As I outlined in the QPainter documentation patch that I added recently, which was the start of this blog series, its all about defining which scenarios we want to be fast and which scenarios we just need working. Over the years since the initial release of the raster engine in the summer of 2005, we've added tons of of special cases to support what we experience as the functions that are called the most and which have the most impact.

  • First of all, if you look at the things we do for each span above, you see that we convert into ARGB32_Premultiplied. Solid colors are easy to represent, gradients are generated in this format directly, so conversion only happens for images and pixmaps. If the image is ARGB32_Premultiplied, then no conversion is needed, and we just use the scanline pointer directly, without any copying. Our RGB32 format is specified to be 0xffRRGGBB, with the alpha set to 0xff. This means it is pixel-wise compatible with ARGB32_Premultiplied, which again means that it can also be used directly. If the source is ARGB32, you'll get a memcpy for each scanline where the ARGB32 data is copied into a temporary buffer and converted to ARGB32_Premultiplied. What can you read from that: Do not draw ARGB32 images into the raster engine. Secondly, don't open a painter on an ARGB32 image, as that implies the exact same, but when reading and writing the destination pixels. Now you know why QPixmap's prefer to be in these formats too..
  • Source composition modes are special cased for most operations. For instance, we don't read the destination for source operations because we know there is no blending involved, unless the spans have partial coverage that is. This means that Source is effectively just a memory write.
  • SourceOver is usually special cased to be either inlined and merged with the coverage opacity so it is also usually faster than the other composition modes. As for the other optimizations down below, these only hold for Source and SourceOver, so if you want best performance, make sure that this is what you are using. SourceOver is the default in QPainter, by the way.
  • For gradients and pixmaps, we need to create an array of source data. For solid colors, its just a single pixel, so this is faster. Source color also benefits from that you only have to traverse memory for the destination, where you write to, so the cache misses are significantly reduced.
  • Rectangle fills are very common, both through QPainter::fillRect and through QPainter::drawRect. In 4.4 both of these implied a state change. Actually, fillRect implied two state changes because it set the brush to what was passed to fillRect and then set it back to what the painter state was. In 4.5, as part of this Falcon project, we introduced a new internal QPaintEngine subclass which supports a state-less fillRect with a color. This matches how applications normally use the painter anyway.
  • In addition to being stateless, the fillRect function is special cased for a number of use-cases. For instance, for RGB16, we write two pixels at a time, for Intel machines there is an SSE/MMX optimzied version. The special cased fillRect also has the benefit that it doesn't require spans, its just a tight 2D for loop, which also saves us quite a bit of work, at least if the spans are short.
  • Duffs Device. I cannot take credit for its addition, but it's used in a lot of different places in the raster engine today. Its all about loop-unrolling. If you're not familiar with it yet, read up on it. Its a beautiful abuse of the C++ language to make things potentially faster.
  • Rectangular clipping is also special cased, at least as long as there is no transformation set on the painter. Translate is of course special cased, but scaling and rotating disables this optimization. The benefit we get from doing rectangular clipping is that finding the spans to fill is done on the QRect level, rather than on the pr span level, which makes it significantly faster.
  • So if you have Source of SourceOver, a non-perspective, non-smooth transform and the clip is a rectangular clip, you also get the benefit of our pixmap blend functions. These were added in Qt 4.5 and is the reason why pixmap drawing is quite a bit faster now than in the earlier versions. In Qt 4.5, we had blend functions for scale and translate only, and in Qt 4.6 we added rotations to the list as well. Again, we focus on a selected subset of formats, matching what QPixmap will be using, we only have these for:
    • ARGB32_Premultiplied on ARGB32_Premultiplied
    • ARGB32_Premultiplied on RGB32
    • ARGB32_Premultiplied on RGB16
    • ARGB8565_Premultiplied on RGB16
    • RGB32 on RGB32
    • RGB16 on RGB16

    I think that was all of them.

  • The outlines are processed via the stroker in the general case. However, there are again a number of special cases where we drop to doing a midpoint-algorithm instead. Lines, polylines and paths that only contain line segments will be rendered using the fast midpoint approach as long as the pen width is equal to or less than 1. We also support dashing line segments for 1 pixel wide lines using this method. For any pen width greater than 1, curved paths or antialiasing, we drop to the stroker approach which works, but is far less optimal. Actually, I think there is a special-case for antialiased dashed lines too, as long as they are thin.
  • When antialiasing is enabled, we often need to fall back to the stroker for outlines which is quite a bit slower than the plain case. In addition to that there are a lot of more spans generated for antialiased content, due to the fade-in, fade-out effect on the edge of the primitive, so expect antialiasing to be a significant cost.
  • Text drawing is since 4.5 highly optimized for most engines, to the point where the major bottleneck these days are in doing the actual text layout on the string. We're working on an API to cache this, so text drawing can be made truly fast, but based on the current API, its as good as it gets. However, if the transformation is a rotate/scale, then we fall back to path drawing. Only the windows version of the raster engine supports drawing glyphs at rotated angles using the fast paths, so beware of that.
  • A lot of details, but it gives an idea of what to consider when you write code for this engine. If all you are drawing is 1024x1024 pixmaps, then none of these things matter because all the time is anyway spent in the span function that does pixmap blending, but the second you have more content, several lines, several polygons, which are smaller in size, then these things are critical to achieve good performance.

    The overall performance of the engine, when used according to how it's outlined above, can be thought of as:

    Overhead + O(pixelsTouched * memoryAndBusCapacity)

    There is nothing scientific about that formula, but when you're hitting the optimal path, all time should be spent in one of the many for loops inside qdrawhelper_xxx.cpp or even better qblendfunctions.cpp. These loops will spend all their time on per pixel processing. If these functions could be made faster by doing the algorithms slightly differently, then great, but if you see in your profiling that all time is spent in for instance qt_blend_argb32_on_argb32, then that means you told us to blend alpha pixmaps together and we're doing that as fast as we can and you have zero loss between your app and actual processing. If all time is spent processing pixels, then that is a good thing. The overhead here is the time spent in state changes, function call overhead, and similar.

    Some numbers

    I got some feedback on one of the previous blogs that a few bar charts would be nice, so I'll post some numbers on what kind of throughput is possible with the raster paint engine. I've timed it on both my Windows desktop machine and on my N900 to get a comparison. The operations range from several million pr second to only a few hundred so the scale is logarithmic, keep that in mind as you look at them.

    Raster Results

    As you can see, the fill-rate is more or less tied to the number of pixels involved. For some operations it takes a little bit longer to do something, like drawPixmap with scaling is somewhat slower than drawPixmap without, but you see that the rough formula I gave above holds quite often. Double the size of the primitive in each direction and you have one quarter the performance. It was also not my intention to trick you with using different numbers for drawPixmap, its just how the test was set up.

    If you compare the three 4x4 rectangle drawing versions, you see that they differ when the rectangles are small. drawRect without brush change is fastest at around 7.4Mops/sec, followed by fillRect at ~6.1Mops/sec and then drawRect with brush change at 1.8Mops/sec. At 128x128 there is just a little difference between the two, which is what I was getting at with the state changes above. It is possible to do them and if you're drawing semi-large areas, it doesn't matter, but if you're plotting pixels, doing loads of small lines here and there or particle effects with 8x8 pixmaps, then you want to do that in a tight loop with nothing else happening.

    You can also see that the speed of non-smooth scaling is holding its own vs non-scaled pixmap drawing.

    Finally, if you compare the N900 to the desktop Windows machine you see that despite windows only having a 4 times faster processor the speed is often around 10 times worse. Why? Because the CPU isn't the only limitation, bus/memory capacity is also a limiting factor, and it's to be honest not a fair comparison...

    I hope you enjoyed this post and more will come in 2010.


    Blog Topics:

    Comments