Braindump, Graphics View and the Joys of off-screen rendering

Published Friday February 27th, 2009
19 Comments on Braindump, Graphics View and the Joys of off-screen rendering
Posted in Graphics, Graphics View, KDE, Qt

Hi all, a short braindump from me here. Sometimes the best way to get things out of your head is to write things down. And there’s a cabin trip for Qt Software this weekend; I don’t want to be thinking about code (yeah right!).

Qt is designed to support both direct (framebuffer, software) rendering (fex Raster, our current default engine on Windows), and indirect rendering models such as X11, OpenGL, and perhaps other future engines such as OpenVG. Graphics View encapsulates structured graphics in Qt into a mid-level API that hides some of the painting details (while giving you full control to do detailed painting). It provides an object model / scene graph, and an abstraction over the rendering pipeline that lets us do many neat tricks to make it easier to run the same application efficiently on several different rendering architectures. Because Qt decides when and how your item is drawn, this gives Graphics View useful control:

  • Qt calls QGraphicsView::paintEvent
  • QGraphicsView::paintEvent calls QGraphicsView::draw{Background,Items,Foreground}
  • each call goes to QGraphicsScene::draw{Background,Items,Foreground}
  • QGraphicsScene::drawItems calls QGraphicsItem::paint

Speed. Software rendering can be very fast, and pixel perfect, although it doesn’t scale very well. For small devices, we’ve seen that software rendering can even outperform GL and do quite amazing things: fullscreen effects and blending, even blur. Still, the GL chipset can often do things faster. It’s just a matter of knowing what it can do, and how to make use of it. But it’s very rare that hardware acceleration gives you pixel perfection and 100% pixel-by-pixel control. The closest you get in other paint engines than Raster is either by rendering into an intermediate paint buffer (using Raster!) and passing that to the rendering system, or by using a shader/fragment program as is done in OpenGL 2.0 (which, on desktop systems with modern cards, produces very very good results!). With indirect rendering models there are several classic “problems” that lead to slow and jerky, sometimes ugly-looking graphics, if you make the assumption that it works just like software rendering. Rotating a pixmap on X11 requires a client-server network roundtrip (or unix domain socket). OpenGL has extensions that allow video playback, but the equivalent of a real-time software 2D/3D rendering engine pushing its pixels onto a GL context is absurd; it’s just not how GL works. The basic idea with indirect models is that you should try to store as much state as possible server-side (X11) or on the graphics card (OpenGL). Fonts glyphs, standard icons and transformable theme elements, push them over! Then per frame all you need to do is say where you want the elements drawn, and how.

Now QPainter has an imperative API that’s based on rendering vector and pixmap graphics to a device. If you want to support an indirect rendering model efficiently, you should render your contents into a buffer which can be passed on to the graphics system, and then referenced when you need it. In Graphics View, because the rendering abstraction is slightly higher, you can avoid having to do this by enabling what we call “Cache Modes”. Notice that cache modes are implemented on top of QPainter, and QGraphicsView is a regular widget, so all Graphics View does is use the cool stuff in Qt, and put it together in a way that’s hopefully useful. πŸ˜‰


item->setCacheMode(QGraphicsItem::ItemCoordinateCache);

Cache modes are for configuring two different types of offscreen buffers in order to accelerate item rendering:

  • ItemCoordinateCache / “Item cache”
    • Rendering the item using an untransformed painter, into a pixmap that is “axis-aligned” with the item’s local coordinate system
    • The pixmap is truncated to the item’s local bounding rect
    • The resolution of the pixmap is configurable
    • The result is never pixel-perfect
    • Repainting the item happens if
      • You call update() on the item
      • The item’s geometry changes (prepareGeometryChange(), or updateGeometry()).
  • Examples: The Qt 4.5 “Boxes” Demo (written by Kim Kalland), and Samuel RΓΈdal’s WolfenQt, use ItemCoordinateCache to allow transformations without repaints.
  • DeviceCoordinateCache / “Device cache”
  • Rendering item using a transformed painter, into a pixmap that is “axis aligned” with the viewport
  • The pixmap is truncated to the item’s _mapped_ bounding rect. (mapped to view)
  • The resolution of the pixmap is fixed and unconfigurable
  • The result is pixel-perfect (no visual difference from direct rendering)
  • Repainting the item only happens if
    • You call update() on the item
    • The item’s geometry changes (prepareGeometryChange(), or updateGeometry()).
    • The item is rotated, scaled, sheared, or projected
  • Example: Plasmoids in KDE use DeviceCoordinateCache to avoid repainting when moving applets around

ItemCoordinateCache has a visual impact because it requires you to decide in advance what resolution your off-screen pixmap should have. This is the only way we can avoid any repaints unless the item wants to be repainted. Here’s what the different cache modes end up presenting to the user:

collidingmice-mouse.png collidingmice-mouse-itemcache.png collidingmice-mouse-itemcache-low.png
NoCache / DeviceCoordinateCache ItemCoordinateCache Low-res ItemCoordinateCache

You can compare DeviceCoordinateCache to rendering exactly what you see in the viewport, at that exact resolution, but into a secondary buffer first. This ensures that the item is rendered in a pixel-perfect way. However, it also means the item must be rerendered when scaled, rotated, sheared, projected. If you transform the item in a way that ends up being a pixel-translation on the screen, the pixmap is just reused and redrawn at a new pos.

The purpose of these modes is to avoid having to repaint the item. In any indirect graphics system, repainting the item requires some sort of a roundtrip as image or vector data is transferred from one side of the graphics pipeline to the other. This is often expensive. QPainter translates into raw OpenGL calls or shader scripts when you draw onto a QGLWidget, and can (as the Pad Navigator Demo shows) produce OpenGL UIs that run fast (but on fast hardware!). However with limited hardware, poor OpenGL chips, when the graphics bus is slow as is common on embedded GL chipsets, the best approach may be to push textures to the “other side”, and “remote control them”.

For the “colliding mice” example, direct rendering is used (i.e., no cache). You get the best performance from this example on Windows, or when running Qt 4.5.0 on Linux with “-graphicssystem raster”. If you switch to OpenGL, it will still run fast, but not as fast as you’d expect for an OpenGL-powered application. So maybe a cache mode would help? Well for one, DeviceCoordinateCache is unsuitable for this demo, as the mice are continuously rotated (invalidating the off-screen buffer at every frame). ItemCoordinateCache is a good match. By setting ItemCoordinateCache on the mice, the mice don’t look as pixel-perfect as without it (especially using Raster), but graphics performance is very high – in fact rendering of the scene goes down to taking ~0% of my desktop system’s resources (the bottleneck of the example becomes the collision-detection). On embedded systems with no FPU, ItemCoordinateCache can also be faster btw, as the example is very floating-point and path-heavy, and painting rotated images might be faster than doing those floatops.

OT: I just want to mention at this point that some people have requested that cache modes become “implicit”, i.e., Qt should devide for you, and you shouldn’t need to toggle something as low-level as this. For a very high level API I would agree. But at the abstraction level Graphics View lives, QGV does not know whether you need pixel-perfection or not, and it does not know if you intent to rotate your items a lot. Only you, dear item author and user, know :-). And that’s why the API is there.

Now what’s happening on the research side with Cache Modes? We are currently researching subtree caching.

Graphics View’s cache modes work fine with simple items. But what about more complex items; items with children? Today, you need to explicitly set the right cache mode on each child. This isn’t an uncommon situation to be in. You stack items inside each other, like you would typically do when creating forms, layout stacks. If you want to do transformations on the whole thing, the most optimal solution is to have the entire item subtree collapsed into a single texture, which is then transformed. In contrast, painting several smaller items (sometimes there’s many of them! 50-100?) with a transformation can be very costly. Either it’s sluggish (no cache) or it exhausts your texture memory (individual caches).

The “Embedded Dialogs” demo shows how a dialog, which represents a potentially complex subtree, is collapsed into a single surface to ensure that there are no repaints as the dialogs are transformed when hovering in and out. This is a “happy accident” though ;-). The QWidget subtree is proxied into a single QGraphicsItem. But it did get me and several others thinking, why doesn’t QGV support this for any kind of item subtree?

deepcache.png
On the left, the item is rendered using no cache (direct rendering). The middle image shows how each element is individually cached, which is the only approach you have today (Qt 4.2, 4.3, 4.5.0). On the right all items are cached into a single surface allowing “no repaints”, while conserving texture memory.

Collapsing a subtree into a single offscreen buffer is possible. I’ve spent two days this week researching it, wrote some code, and ended up with a prototype that’s so ugly I don’t want to share it _just yet_. πŸ˜‰ But I’ve seen that it’s perfectly possible without messing up QGV’s internals. I dubbed two new cache modes:

  • DeepItemCoordinateCache – caches the item and “all” children, no repaints for “any” child if the parent is transformed
  • DeepDeviceCoordinateCache – save for DeviceCoordinateCache

During prototyping, I found a few issues that need to be solved, but it’s not a huge job to make this work.

  • I currently have no idea how to handle ItemIgnoresTransformations
  • Window children probably shouldn’t be collapsed into the same cache
  • As children move, transform or update, the changes must be recorded in the caching parent’s offscreen buffer
  • Combining existing cache modes with new deep cache modes should work fine (i.e., parent sets deep cache, child already have itemcoordinatecache)
    • But what if the child uses DeviceCoordinateCache? πŸ˜›

How can deep caching be reused in an unexpected way?

  • In the chip demo, when selecting and moving items around, we can temporarily reparent all selected items into a parent, enable deep device coordinate cache on the parent, and then remove the parent when the items stop moving. This means that even though you might be moving hundreds of items around, you’re actually only _really_ moving a little pixmap. Hah!

So many unanswered questions, but that’s the fun part with research. Since I have seen that it works and produces the result I want (no repaints! whole dialog lives in graphics memory), I’m certain the answers to the questions above will show up one at a time.

OK that was a long blog, but I felt like writing it all down.

What do you think? Is DeepItemCoordinateCache and DeepDeviceCoordinateCache useful?

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in Graphics, Graphics View, KDE, Qt

19 comments

sangeetha says:

hi..
i’m facing problem while installing Qt on linux.
the error is
gmake[3]:***[.obj/debug_shared/qfontengine_x11:0] error1

could u please tell me what it is? and how do i rectify it.

alexis.menard says:

Qt-Interrest mailing list, the support, or Google is made for that. Here is not the place and more your error can’t help anyway.Sorry

notmart says:

fascinating post…
for ItemIgnoresTransformations sounds tricky indeed since the result depends on each view and on each parent transform.
not knowing anything about internals comes to my mind creating a copy in the cache for each different combined transform of parents+view, it would still save a bit in case of simple translations but the size of cached stuff would grow exponentially
or, cache the pixmap of the parent and all the children that obeys to transforms and use individual Item/Device CoordinateCache on the ignoring childrens…
that said no idea if it’s possible or if it makes sense eh πŸ˜€

Wow!! That was a good brain dump which helped me understand the caching mechanism.. Thanks for explaining πŸ™‚

While it won’t hurt to exploit the new cool caching functionality, I feel that its rare that people use items with several children with complex hierarchy which means the app developer can be given the responsibility of representing item with several child items as a single item.

Andreas says:

Gopala: Not sure – in Plasma the base Plasmoid item is cached with DeviceCoordinateCache, but the children are not. So it’s still slow to move _some_ of the items around. If the base had DeepDeviceCoordinateCache, then it would all be fast. Just like when moving windows around on your desktop (which doesn’t usually trigger repainting).

extropy says:

Just a tought on another caching mode for high performance graphics cards.

Imagine we cache the GL commands produced from QPainter calls and execute them on each repaint, applying the needed transform. Applying a transform on them shouldn’t be difficult.
Also it may be possible to cache theese commands on display server / graphics card and paint only sending the transformation data over the wire, similar to pixmaps.
This would give both pixel perfect image and no transformation penalty.

If CUDA can run programs on openGL, this should be doable.
I do not know if X11 has any sort of “GL command buffer”, but it sounds very reasonable thing to have for accelerated vector graphics.

Adam Higerd says:

Big ++ on these caching modes. I hear the current behavior griped about in #qt occasionally, especially when people have things like a container full of items and want to move the container — it’s more common than you might think.

Also big ++ on the idea of having a caching mode that can exploit hardware availability and fall back to something useful if hardware features aren’t present.

Zack Rusin says:

ey Andreas, this was not a terribly good blog. The graphics section is pretty bad. So in no particular order:

“For small devices, we’ve seen that software rendering can even outperform GL”
– Yes of course, that’s not the point of it. If you’re working with some of the OMAP chipsets, or really any of the ones running top of the line ARM Cortex or Atom processor running at +600MHz and the GPU that is ticking between 66MHz and 200Mhz (if you’re really lucky), then that will always be the case. It’s not the point of those GPU’s in small devices. They’re there because they do the same thing at >90% of the speed of CPU with less than 10% of power usage.
Go ahead and ask your users whether they would prefer to see the blur at 33 instead of 30 frames per second while their device has 1/10 of the power life it would have at 30 frames per second and see the kind of response you’ll get πŸ™‚ People won’t be happy to find out that their phone is suddenly running for 2 hours instead of 20 because the graphics system is running 3 frames per second faster.

“But it’s very rare that hardware acceleration gives you pixel perfection and 100% pixel-by-pixel control.”
– Modern GPU’s always give you pixel by pixel control! It’s really what they do. You mention “pixel perfection” but I don’t see any benefit to it. Yes, the results will vary between vendors, but, especially in your light of the mobile metaphor, who would be expecting pixel perfection between two completely different devices from different vendors? And why?

“Rotating a pixmap on X11 requires a client-server network roundtrip”
– Well, that’s not completely true. In the Qt case you’d be transforming the Xrender picture. Xrender does support it, in which case it’s not a roundtrip at all. It is a roundtrip in Qt because Qt doesn’t use server side transforms for Xrender pictures. So realistically that’s Qt’s bug, not a limitation of rendering mode.

“but the equivalent of a real-time software 2D/3D rendering engine pushing its pixels onto a GL context is absurd”
– That’s also not true. That in turn has been a limitation of the poor memory management that we’ve doing in graphics drivers. There’s tons of benchmark that test download/upload speeds with GPU’s. It’s simply a matter of making sure the upload is done in on the native formats of the GPU, so that the colors don’t have to be converted. On average you should be close to between 500MB/s-3GB/s. If you need more then really the problem is in your software πŸ™‚

“If you want to support an indirect rendering model efficiently, you should render your contents into a buffer which can be passed on to the graphics system, and then referenced when you need it”
– I don’t get it. Why?

So yea … =) I guess the point you were trying to make was that doing graphics on the CPU can make more sense than doing on the GPU. I (and surely your employer) will violently disagree with that. Possibly though this could be the case for mobile devices that don’t care about battery life and don’t need any advanced effects, 3D or other neat things that GPU’s provide and just want very simple 2D vector graphics this would be right on.

p.s. the submit forms are broken. they just cut the rest of the text if you include &lt symbol

maninalift says:

@Zach I’m getting to like your bad-tempered attacks on software rendering πŸ˜‰

David says:

I was a bit confused as to why you call X11, and OpenGL calls indirect methods, as they are talking directly to the display. I would think things that do blitting to the screen would be the indirect methods. Maybe it just a matter of semantics as to ones perspective.

On another note concerning round trips to the X Server. Most displays are local so the socket call is really just using shared memory. If this is the case (and I am asking) how expensive is this kind of memory access?

It maybe my misunderstanding as I am not a graphics expert by any means, but if your making a case that using the CPU is better then the GPU, it seems rather difficult to swallow. As GPU’s have advanced so much that even the cheapest would offer benefits for simple transformations and rendering. What concerns me here is that some of the nice Qt animated examples I have tried tend to take a lot of CPU resources.

I enjoyed the article and the responses as they were very thought provoking and will no doubt help educate me. Enjoy the cabin trip.

-David

Adam Higerd says:

@David: They’re indirect as far as you’re relaying the painting commands to a separate component (the X server or the video card), whereas on a framebuffer the pixels of the display are directly available to your memory space (or at least they are as far as your application is concerned).

Shared memory is practically free in terms of performance. It’s just a memcpy at most.

So naturally you’ll get better performance by doing local rendering and transmitting a fully-rendered scene across the pipe, IF the pipe or the other component is slow. The flip side of this is that this same tactic will yield WORSE performance if the pipe doesn’t introduce significant delay and the other component can actually do the rendering faster than you can.

And then Zack made a very insightful comment: Sometimes it’s not about speed. Sometimes there are more important factors to consider — in Zack’s example, the fact that an embedded GPU can do the same work ALMOST as well as the CPU, at a fraction of the power consumption. Other possible trade-offs are CPU consumption (if the GPU can do the work the CPU is free to pay attention to other processes) and/or bandwidth (if the fully-rendered scene uses more than the command stream to render it). Even Andreas made a comment about texture memory.

simong says:

Hierarchical caching makes tremendous sense. We use complex compositions of graphics items and get no benefit from individual item caches. We have experimented with doing exactly what you describe, and it does help. When dragging selected elements treating them as a group has interesting consequences for Z-ordering (which, by the way needs work in GraphicsView–if not explicitly specified it can be a consequence of pointer ordering). I would not be unhappy to see a mode in which all selected (or moving) elements are moved to the front, in which case, transforming them as a pre-cached image could be done purely by GV without a user doing any reparenting work.

Andreas says:

Zack:

1. That’s a good point, power consumption is incredibly important – what I’m talking about here is the actual FPS. When I say outperform, I mean that one can run faster than the other. For many applications that’s the only thing that counts, but for others it’s not.

2. This point was about pixel perfection. Yes, it’s not important to some applications and some user groups but it’s important to others.

3. Rotating a pixmap on X11 – FWIU remote transforms of pixmaps aren’t supported all over the place. If Qt can support the servers that do, that’s great!

4. Are you referring to modern desktop GPUs? They don’t represent the whole picture. Pipelines on embedded devices aren’t yet close to what you’re describing.

5. I hope you’re not argumenting against state objects, which you were advocating so warmly a year ago ;-). No but seriously, the place where performance hurts the most for Qt right now is on the embedded space. And the GL chips on these devices can all do one thing really fast, and that’s to render texture filled polygons. It’s primitive yeah, but this blog shows that we can accommodate that.

Finally no, I don’t think graphics should be done in the CPU alone. In fact that’s the whole point; as long as you can reuse elements that already exist on the other side of the bus, you don’t have to. When you cache prerendered objects as GL textures, CPU usage goes down and your framerate goes up.

Andreas says:

simong: With the Z order problem I assume you mean how every second item in the chip demo has an alternating Z value, so if you move items with no “deep caching” and look closely, you’ll see how the items interleave the other chips as they are dragged over them. I don’t think deep caching should try to solve this at all. Rather, if you want to use a deep device cache to speed up moving of chips and do that by reparenting all selected items into a new parent, then all these chips will inherit the stacking order of that parent. You can choose to give that parent item a value that stacks over all chips. Then while you’re dragging, all the items will be on top, and when you drop the items, you can remove the parent, and the original stacking order is restored.

simong says:

Thanks Andreas. I understand how it would work to do it by hand. It’s just kind of a pain to make users reimplement it for a common use case especially since they’ll also have to take transforms into account when they reparent. Select all (or almost all), and move is a very common operation which today performs badly.

The general z-order problem I was referring to I think you may have addressed in 4.5! In 4.4, the following code appears:

inline bool qt_closestLeaf(const QGraphicsItem *item1, const QGraphicsItem *item2)
{
qreal z1 = item1->d_ptr->z;
qreal z2 = item2->d_ptr->z;
return z1 != z2 ? z1 > z2 : item1 > item2;
}

which means that items with the same parents and the same z-order, are ordered by the memory layout of their pointers, which was a very bad idea. From inspection of the code, it looks like 4.5 has introduced a global stacking order, which takes care of this problem. I can’t wait to remove our workaround!

Zack Rusin says:

1) I doubt those “graphics critical” apps will run on mobile systems =) And if they will they will want to use actual GL rather than Qt vector graphics either way. Not to mention that it’d be weird if a toolkit would optimize for that one wacky application that needs to run on a phone at 33 instead of 30, or 17 instead of 15fps while using 10x more power.
2) Pixel perfection (aka. pixel exactness) is perfectly possible on GPU’s. There’s tons of blogs/articles about how to achieve it while doing 2D with OpenGL so I’m not going to go into detail here, but realistically if it’s as important as you mention (which I’m sure you know I disagree with πŸ™‚ ) then it’s a Qt bug.
3) They’re supported on all X servers Qt should care about. It doesn’t make too much sense to blame 10 or more year old X servers for not supporting something you want to be doing now πŸ™‚
4) On embedded systems the difference will be even smaller. The only difference will be that the vid mem will be uncached but the writes will go at the same speed as raster engine.
5) Well, you mention download speeds which constant state objects eliminate so I think you mean something else. Also If a GPU that you’re looking at is fast at doing texture mapping then it will be even faster just rendering quads with shaders that don’t do texturing. I never profiled on any embedded GPUs but on desktops the latency from texture instructions is within 100-1000 range. You can fit a lot of extra instructions there, if you avoid texturing.
Also there are obvious ways to optimize the GL paint engine especially on the mobile GPUs. I’ll fire off an email about that later.

Andreas says:

simong: Yes, that’s true. Maybe we should look into a way to make it easier. It is possible to use QGraphicsScene::createItemGroup, which does the reparenting for you, but maybe there’s more we can do. And yes, in the latest codeline (4.6 unfortunately!) the items are ordered by insertion instead of by pointer when Z is the same. In Qt 4.5 we’ve introduced a global stacking order internally, to help speed up sorting (see QGraphicsScene::sortCacheEnabled).

Zack: I won’t try to advocate myself as being as much of a hardware graphics expert as you are. But I am running a project which does intend for Qt’s APIs to run on small devices that do or do not have (efficient) GL implementations. Of course, pure GL must still be an option. I also agree that making an app run at 30fps at the cost of slicing the battery lifetime in 1/2 when there’s a perfectly fine GL chip inside is a bit stupid. On the note about state objects, yes caching elements in textures is a way to reduce the need for a fast bus, just like state objects to for scene graphs. Some experiments we have run have shown that going beyond 16 instructions, shaders suddenly started thrashing. The embedded world isn’t quite there yet. But these chips can still render textures very fast ;-). Btw, coming to Bossaconference?

Thanks for the feedback people! It’s why we write blogs… πŸ™‚

Marco Bubke says:

What about a implementation of the raster engine in OpenCL. So you have full control over the rasterer(you bypass the rasterer in OpenGL) and can use the computed images from OpenCL as a Texture in OpenGL. This can maybe bypass QPainter too and implement something more buffer oriented. So the (vertex) buffer can be on the graphics card and don’t must be transfered(and converted) if a repaint is needed. Only the changed data(mostly the transformation matrix) will be transfered. The images can be rendered on the GPU to the native resolution. Best of both worlds. πŸ™‚

Anssi says:

thnx of interesting blog. i think that sometimes it is not important what is it now but what it will become. that’s why i don’t want that you close the door from specialized hwa architecture because cpu must do many other task that just draw 2d float graphics. and yes there isn’t really true 3d killer apps that would take advantage of true 3d graphics. imo opengl can be common standard and implemented in a SoC block which may help portability, cpu power is not always the way to go.

Commenting closed.

Get started today with Qt Download now