Both the public documentation for the scene graph and some of my previous posts on the subject have spoken of a backend or adaptation API which makes it possible to adapt the scene graph to various hardware. This is an undocumented plugin API which will remain undocumented, but I try to go through it here, so others know where to start and what to look for. This post is more about the concepts and the ideas that we have tried to solve than the actual code as I believe that the code and the API will most likely change over time, but the problems we are trying to solve and the ideas on how solve them will remain.
Some of these things will probably make their way into the default Qt Quick 2.0 implementation as the code matures and the APIs stabilize, but for now they have been developed in a separate repo to freely play around with ideas while not destabilizing the overall Qt project.
The code is available in the customcontext directory of ssh://codereview.qt-project.org:29418/playground/scenegraph.git
When we started the scene graph project roughly two years ago, one of the things we wanted to enable was to make sure we could make optimal use of the underlying hardware. For instance, based on how the hardware worked and which features it supports, we would traverse the graph differently and organize the OpenGL draw calls accordingly. The part of the scene graph that is responsible for how the graph gets turned into OpenGL calls is the renderer, so being able to replace it would be crucial.
One idea we had early on was to have a default renderer in the source tree that would be good for most use cases, and which would serve as a baseline for other implementations. Today this is the QSGDefaultRenderer. Other renderers would then copy this code, subclass it or completely replace it (by reimplementing QSGRenderer instead) depending on how the hardware worked.
Example. On my MacBook Pro at the time (Nvidia 8600M GT), I found that if I did the following:
- Clear to transparent,
- render all objects with some opaque pixels front to back with blending disabled, while doing “discard” on any non-opaque pixel in the fragment shader, but writing the stacking order to the z-buffer,
- then render all objects with translucency again with z-testing enabled, this time without the discard,
I got a significant speedup for scenes with a lot of overlapping elements, as the time spent blending was greatly reduced and a wast amount of pixels could be ignored during the fragment processing. Now, in the end, it turned out (perhaps not surprising) that “discard” in the fragment shader on both the Tegra and the SGX is a performance killer, so even though this would have been a good solution for my mac book, it would not have been a good solution for the embedded hardware (which was overall goal at the time).
On other hardware we have seen that the overhead of each individual glDrawXxx call is quite significant, so there the strategy has been to try to find different geometries that should be rendered with the same material and batch them together while still maintaining the visual stacking order. This is the approach taken by the “overlap renderer” in the playground repository. Cudos to Glenn Watson in the Brisbane office for the implementation.
Some other things that the overlap renderer does is that it has some compile-time options that can be used to speed things up:
- Geometry sorting – based on materials, QSGGeometryNodes are sorted and batched together so that state changes during the rendering are minimal and also draw calls are kept low. Take for instance a list with background, icon and text. The list is drawn with 3 draw calls, regardless of how many items there are in it.
- glMapBuffer – By letting the driver allocate the vertex buffer for us, we potentially remove one vertex buffer allocation when we want to move our geometry from the scene graph geometry to the GPU. glVertexAttribPointer (which is all we have on stock OpenGL ES 2.0) mandates that the driver takes a deep copy, which is more costly.
- Half-floats – The renderer does CPU-side vertex transformation and transfers the vertex data to the GPU in half-floats to reduce the memory bandwidth. Since the vertex data is already in device space when transferred, the loss of precision can be neglected.
- Neon assembly – to speed up the CPU-side vertex transformation for ARM.
If you are curious about this, then I would really want to see us being able to detect the parts of a scene graph that is completely unchanged for a longer period of time and store that geometry completely in the GPU as vertex buffer objects (VBO) to remove the vertex transfer all together. I hereby dare you to solve that nut
And if you have hardware with different performance profiles, or if you know how to code directly in the language of the GPU, then the possibility is there to implement a custom renderer to make QML fly even better.
The default implementation of textures in the QtQuick 2.0 library is rather straightforward. It uses an OpenGL texture with the GL_RGBA format. If supported, it tries to use the GL_BGRA format, which saves us one RGBA to BGRA conversion. The GL_BGRA format is available on desktop GL, but is often not available on embedded graphics hardware. In addition to the conversion which takes time, we also make use of the glTexImage2D function to upload the texture data, which again takes a deep copy of the bits which takes time.
Faster pixel transfer
The scene graph adaptation makes it possible to customize how the default textures, used by the Image and BorderImage elements, are created and managed. This opens up for things like:
- On Mac OS X, we can make use of the “GL_APPLE_client_storage” extension which tells the driver that OpenGL does not need to store a CPU-side copy of the pixel data. This effectively makes glTexImage2D a no-op and the copying of pixels from the CPU side to the GPU happens as an asynchronous DMA transfer. The only requirement is that the app (scene graph in this case) needs to retain the pixel bits until the frame is rendered. As the scene graph is already retained this solves itself. The scene graph actually had this implemented some time ago, but as I didn’t want to maintain a lot of stuff while the API was constantly changing, it got removed. I hope to bring it back at some point
- On X11, we can make use of the GLX_EXT_texture_from_pixmap where available and feasible to directly map a QImage to an XPixmap and then map the XPixmap to a texture. On a shared memory architecture, this can (depending on the rest of the graphics stack) result in zero-copy textures. A potential hurdle here is that XPixmap bits need to be in linear form while GPUs tend to prefer a hardware specific non-linear layout of the pixels, so this might result in slower rendering times.
- Use of hardware specific EGLImage based extensions to directly convert pixel bits into textures. This also has the benefit that the EGLImage (as it is thread unaware) can be prepared completely in QML’s image decoding thread. Mapping it to OpenGL later will then have zero impact on the rendering.
- Pixel buffer objects can also be used to speed up the transfer where available
Another thing the texture customization opens up for is the use of texture atlases. The QSGTexture class has some virtual functions which allows it to map to a sub-region of a texture rather than the whole texture and the internal consumers of textures respect these sub-regions. The scene graph adaptation in the playground repo implements a texture atlas so that only one texture id can be used for all icons and image resources. If we combine this with the “overlap renderer” which can batch multiple geometries with identical material state together, it means that most Image and BorderImage elements in QML will point to the same texture and will therefore have the same material state.
Implementation of QML elements
The renderer can tweak and change the geometry it is given, but in some cases, more aggressive changes are needed for a certain hardware. For instance, when we wrote the scene graph, we started out with using vertex coloring for rectangle nodes. This had the benefit that we could represent both gradients, solid fills and the rectangle outline using the same material. However, on the N900 and the N9 (which we used at the time) the performance dropped significantly when we added a “varying lowp vec4″ to the fragment shader. So we figured that for this hardware we would want to use textures for the color tables instead.
When looking at desktops and newer embedded graphics chips, vertex coloring adds no penalty and is the favorable approach, and also what we use in the code today, but the ability to adapt the implementation is there. Also, if we consider batching possibilities in the renderer, then using vertex coloring means we no longer store color information in the material and all rectangles, regardless of fill style or border can be batched together.
The adaptation also allows customization of glyph nodes, and currently has the option of choosing between distance fields based glyph rendering (supports sub pixel positioning, scaling and free transformation) and the traditional bitmap based glyph rendering (similar to what QPainter uses). This can then also be used to hook into system glyph caches, should these exist.
The animation driver is an implementation of QAnimationDriver which hooks into the QAbstractAnimation based system in QtCore. The reason for doing this is to be able to more closely tie animations to the screen’s vertical blank. In Qt 4, the animation system is fully driven by a QTimer which by defaults ticks every 16 milliseconds. Since we know that desktop and mobile displays usually update at 60 Hz these days, this might sound ok, but as has been pointed out before, this is not really the case. The problem with timer based animations is that they will drift compared to the actual vertical blank and the result is either:
- The animation advances faster than the screen updates leading to the animation occasionally running twice before a frame is presented. The visual result is that the animation jumps ahead in time, which is very unpleasant on the eyes.
- The animation advances slower than the screen updates leading to the animation occasionally not running before a frame is presented. The visual result is that the animation stops for a frame, which again is very unpleasant on the eyes.
- One might be extremely lucky and the two could line up perfectly, and if they did that is great. However, if you are constantly animating, you would need very high accuracy for a drift to not occur over time. In addition, the vertical blank delta tends to vary slightly over time depending on factors like temperature, so chances are that even if we get lucky, it will not last.
I try to illustrate:
The scene graph took an alternative approach to this by introducing the animation driver, which instead of using a timer, introduces an explicit QAnimationDriver::advance() which allows exact control over when the animation is advanced. The threaded renderer we currently use on Mac and EGLFS (and other plugins that specify BufferQueueing and ThreadedOpenGL as capabilities), uses the animation driver to tick exactly once, and only once, per frame. For a long time, I was very happy with this approach, but there is one problem still remaining…
Even though animations are advanced once per frame, they are still advanced based to the current clock time, when the animation is run. This leads to very subtle errors, which are in many cases not visible, but if we keep in mind that both QML loaders, event processing and timers are fired on the same thread as the animations it should be easy to see that the clock time can vary greatly from frame to frame. This can result in a that an object that should move 10 pixels per frame could move for instance 8, 12, 7 and 13 pixels over 4 frames. As the frames are still presented to screen at the fixed intervals of the vertical blank, this means that every time we present a new frame, the speed will seem different. Typically this happens in the case of flicking a ListView, where every time a new delegate is created on screen, that animation advance is delayed by a few milliseconds causing the following frame feel like it skips a bit, even though the rendering performance is perfect.
I try to illustrate:
So some time ago, we added a “time” argument to QAnimationDriver::advance(), allowing the driver to predict when the frame would be presented to screen and thus advance it accordingly. The result is that even though the animations are advanced at the wrong clock time, they are calculated for the time they get displayed, resulting in is velvet motion.
A simple solution to the problem of advancing with a fixed time would be to increment the time with a fixed delta regardless, and Qt also implements this option already. This is doable by setting
in the environment. However, the problem with this approach is that there are frames that take more than the vsync delta to render. This can be because it has loads of content to show, because it hooks in some GL underlay that renders a complex scene, because a large texture needed to be uploaded, a shader needed to be compiled or a number of other scenarios. Some applications manage to avoid this, but on the framework level, recovery in this situation needs to be handled in a graceful manner. So in the case of the frame rendering taking too much time, we need to adapt, otherwise we slow down the animation. For most applications on a desktop system, one would get away with skipping a frame and then continuing a little bit delayed, but if every frame takes long to render then animations will simply not work.
So the perfect solution is a hybrid. Something that advances with a fixed interval while at the same time keeps track of the exact time when frames get to screen and adapts when the two are out of sync. This requires a very accurate vsync delta though, which is why it is not implemented in any of our standard plugins, and why this logic is pluggable via the adaptation layer. (The animation driver in the playground repo implements this based on QScreen::refreshRate()). So that on a given hardware, you can get the right values and to do the right thing.
And last, despite all the “right” things that Qt may or may not do, this still requires support from the underlying system. Both the OpenGL driver and the windowing system may impose their own buffering schemes and delays which may turn our velvet into sandpaper. We’ve come to distinguish between:
- Non blocking – This leads to low latency with tearing and uneven animation, but you can render as fast as possible and current time is as good as anything. In fact, since nothing is throttling your rendering, you probably want to drive the animation based on a timer as you would otherwise be spinning at 100% CPU (a problem Qt 5.0 has had on several setups over the last year). Qt 5.0 on linux and windows currently assumes this mode of rendering as it is the most common default setting from the driver side.
- Double buffered w/blocking swap – This leads to fairly good results and for a long time I believe this was the holy grail for driving animations. Event processing typically happens just after we return from the swap and as long as we advance animations once per frame, they end up being advanced with current clock time with an almost fixed delta, which is usually good enough. However, because you need to fully prepare one buffer and present it before you can start the next you have only one vsync interval to do both animations, GL calls AND let the chip render the frame. The threaded renderloop makes it possible to at least do animations while the chip is rendering (CPU blocked inside swapBuffers), but it is still cutting it a bit short.
- 3 or more buffers w/blocking – Combined with predictive animation delta and adaptive catch-up for slow frames, this gives perfect results. This has the added benefit that if the rendering is faster than the vsync delta, we can prepare queue up ready frames. Having a queue of prepared frames means we are much more tolerant towards single frames being slow and we can usually survive a couple of frames that take long to render, as long as the average rendering time is less than the vsync delta. Down side of this approach is that it increases touch latency.
So, we did not manage to come up with a perfect catch-all solution, but the scene graph does offer hooks to make sure that a UI stack on a given platform can make the best possible call and implement the solution that works best there.
The implementation inside the library contains two different render loops, one called QQuickRenderThreadSingleContextWindowManager and another one called QQuickTrivialWindowManager. These rather long and strange names have grown out of the need to support multiple windows using the QtQuick.Window module, and was named window manager for that reason, but what they really are are render loops. They control when and how the scene graph does its rendering, how the OpenGL context is managed and when animations should run.
The QQuickRenderThreadSingleContextWindowManager (what a mouthful) advances animations on the GUI thread while all rendering and other OpenGL related activities happen on the rendering thread. The QQuickTrivialWindowManager does everything on the GUI thread as we did face a number of problems with using a dedicated render thread, particularly on X11. Via the adaptation layer, it is possible to completely rewrite the render loop to fit a given system.
To remedy this, I started playing with the idea that the GUI thread would rather be a slave of the render thread and that some animations could run in the render loop. The render loop in the playground repo implements the enablers for this in the render loop, opening for a potential animation system to run there regardless of how the GUI thread is running.
There are a lot of ideas here and a lot of work still to be done, and much of this does not “just work” as Qt normally does. Partially this is because we have been very focused on the embedded side of things recently, but also because graphics is hard and making the most of some hardware requires tweaks on many levels. The good news is that this API makes it at least possible to harness some of the ability on the lower levels when they are available, and it is all transparent to the application programmer writing QML and C++ using the public APIs.
Thank you for reading!