Have you ever been developing an app, and at some point wanted to know exactly which code was being run, and for how long, when you used it?
On the path to getting Qt inside your Nokia devices, we’ve recently been making use of a special Nokia tool for doing just that, and I thought I would give you a short behind-the-scenes tour of it so that you can see how serious we are about taking Qt everywhere.
If you’ve ever tried to speed up some of your own code before, you probably used some kind of profiling tool. Valgrind is a good example, and one that is heavily used in the development of Desktop Qt.
Desktops normally use x86 processors, but away from the desktop and running on batteries, ARM processors are most widely used. These systems tend to have quite a different architecture from desktop computers, as well as slower clocks and smaller caches. So some effort is needed to make software run fast on them because the performance bottlenecks can also be quite different.
One feature ARM-based chips often have which you don’t find on the desktop is ETM. Using this, the ARM can send trace information about every instruction it executes out through a dedicated debug port in real time, without slowing down execution of the software at all.
As ARMs usually run at hundreds of MHz, ETM creates a large amount of data in a very short time, so you have capture it using an external box with a few Gbytes of RAM. For cost reasons, you won’t ever find an ETM port on a device you buy in the shops, but during development of the devices, special versions are often made with just such a port.
Here’s a picture of a “special device” which is ETM-enabled, along with the Lauterbach PowerTrace II I use to capture the data:
Even with Gigs of RAM, you can only trace for about 10 seconds or so. Although that doesn’t sound like much, it’s actually enough to take a look at almost all interesting use cases.
Ok, so now I’ve got all that data, how do I make some sense of it? This is where FineToothComb comes in.
FineToothComb, or FTC as we usually call it, is a Nokia tool which takes all that ETM data plus some extra information taken from the device at the same time using JTAG, and all the binary code and symbols installed on the device, and then spends some minutes correlating all that together to create a database a few Gigs in size for a each trace.
To see what happened I then use FTC’s “Trace Explorer”, which starts off showing me something like this:
What you can see here is a “MIPS graph”. Time increases to the right, and the number of millions of instructions executed per second goes upwards. The colours represent different threads and processes that ran, with the names and instruction counts for those shown below the graph.
Because of memory latency and caches, you almost never see a CPU running one instruction for every clock cycle. Memory copies especially cause the CPU to spend most of the time waiting, which show up as “low” MIPS values on these graphs.
This first graph gives you the big picture of what happened during your use case. To see more detail you can zoom in on a region of interest. Here’s 165ms:
And again down to just 1.3ms:
Eventually you get down to the level of seeing individual context switches between threads – the operating system in action.
To see what a thread was doing, you simply choose it from the graph and the tool shows you a call stack for that time, along with the time taken (in nanoseconds) and number of instructions run for each call. From that last graph I opened the “lightmaps” stack:
The white lines are blocks of instructions being run from within one function. Red lines are calls from there to other functions, with the total time for that call in nanoseconds show in the second column.
Any call can be opened up to see what it called, in this case I opened the operator>>> call to see that it ended up in a file read which went to the Symbian file server. The tool can even cross over from user space into kernel calls to see what happened there as a result of a user call. Green lines show calls that have been opened up. As you move the mouse over calls, the graph at the bottom highlights the time range for that call so you can see where you are in the trace.
You can also see into any system servers or kernel threads that ran, as well as interrupts. Another feature is statistics of which functions used time and instructions for any block of time. Here are the top 10 functions from the big blue section in the 2nd graph – obviously a PNG decode:
And it can even capture and show what was on the display of the device to help understand the trace later:
Overall it’s really useful for quickly homing in on any code that needs performance optimisation, and because it’s not just a statistical sampling of what happened, it can be used for studying many kinds of use cases, for instance answering a question of why a system is waiting when it should be busy. Most importantly it does all this with the device running at full speed without requiring any special “debug” software build, so you don’t need to guess about how anything is in the “real” case, because this is it.
Using FineToothComb on Qt has already helped us quite a bit with performance of the Qt port to Symbian. Early on in the port, it highlighted some bottleneck functions which were converted from smaller but slower THUMB instructions to faster ARM instructions. It caught when we were not compiling using the ARM’s vector floating point instructions. It also helped with the optimisation of raster rendering functions using ARMv6 and data preloading. In another case it caught the fact that we were using a slow blit operation for the final screen update, when a different parameter would cause a much faster blit to be used.
Even though FTC isn’t a tool that is available outside Nokia, the improvements it can and has found already are, and more of them should be coming your way soon.
I’m going to be at the Qt Developer Days 2009 in Munich, so if you are there too and would like a quick demo of FineToothComb, just track me down and ask!