Fast-Booting Qt Devices, Part 3: Optimizing System Image

It is now time for the third part of the fast-boot blog post series. In the first post we showed a cluster demo booting in 1.56 seconds, in the second post we opened up how the Qt application was optimized. Here, we will concentrate on the optimization of boot loader and kernel for NXP i.MX6 SABRE development board which was used in the demo.

Before even starting the boot time optimization I measured the unoptimized boot time which was a whopping 22.8 seconds. After measuring we set our goal to get boot time under 2 seconds. Now that the goal was set. I started the optimization from the most obvious place: root-fs. Our existing root-fs contained a lot of stuff that was not required for the startup demo. I stripped down the whole root-fs from 500 MB to 24 MB. I used buildroot to create a bare minimal root-fs for our device and a cross-compile tool chain.

After switching to the smaller root-fs, I did a new measurement of the startup time which was now 15.6 seconds. From this 15.6 seconds kernel, startup took around 6 seconds, the U-Boot bootloader and the unmodified application the rest. Next, I concentrated to the kernel. As I already knew the functionality required by the application, I could easily strip down the kernel from 5.5 MB to 1.6 MB by removing nearly everything that was not required. This got the boot time to 9.26 seconds out of which the kernel startup was taking 1.9 seconds.

At this point we still had not touched the u-boot at all, meaning it had the default 1 second wait time and integrity check of the kernel in place. So U-Boot was next obvious target. Inside U-Boot there is special framework called secondary program loader which is capable of booting another U-Boot or specially configured kernel. I enabled the SPL mode and modified my kernel to include command line arguments and appended my device tree to the kernel. I also stripped down the device tree from 47 KB to 14 KB and disabled console. Boot time was dropped to 3.42 seconds where kernel was taking 0.61 seconds and U-Boot + application rest.

Now that the basic system (u-boot and kernel) was booting already in a decent time, I optimized our cluster application. Start up of the application was changed to load the cluster frame first and then animate in gauges and last the 3D car model as described in our previous post. Boot time was still quite far away from the 2 second target so I did more detailed analysis of the system. I was using class 4 SD card which I changed to class 10 card.

My Qt libraries were still shared libraries so I compiled Qt as static libraries and recompiled our cluster demo using the static version of Qt. This also allowed me to remove the shared libraries from the root-fs. Using static linking makes startup of application faster since operating system do not need to solve symbols of dynamic libraries. With static linking I was able to get the cluster application into one binary with size of 19 MB.  This includes all the assets (3D model, images, fonts) and all the Qt libraries required by the demo.  I actually forgot to use the proper optimization flags for my Qt build so I set optimization for size and removed fpic as a result executable size was reduced to 15 MB. I also noticed that having the root-fs on the eMMC was faster than having it on SD card.

However, having the u-boot and kernel image on SD card was faster than having both in eMMC, so I ended up to a bit weird combination where CPU is loading u-boot and kernel from SD card and kernel uses root-fs from eMMC. Kernel was still packged with gzip. After testing out UPX, LZO and LZ4 I changed packing algorithm to LZO which was fastest on my hardware. Depending on hardware you might want to test other algorithms or having no packing at all.  After changing the packing algorithm and removal of serial console the kernel image size was dropped to 1.3 MB. With these changes the boot time was reduced to 1.94 seconds.

If this would be a production software there is still work to be done in the area of memory configuration. U-boot should be debugged to understand why it takes more time to power up and load the kernel image from eMMC rather than from SD card. In general if quick startup time is a key requirement, the hardware should be designed accordingly. You could have small very fast flash containing the u-boot & kernel directly accessed by the CPU and then having the root-fs a bit slower flash like eMMC. 

Even tough I succeeded to get under 2 seconds I still wondered if I could make it faster. I stripped down the kernel a little bit more by removing the network stack ending up to 1.2 MB kernel with appended device tree. I also ran prelinking to my root-fs because the Vivante drivers come as modules, so I was not able to create static root-fs. I also striped the u-boot spl part a bit, initially it was 31 KB and after removing unwanted parts I ended up with 23 KB boot loader. With these final changes I was able to get the system to boot up in 1.56 seconds.

As a wrap-up here is how the boot time was reduced by different means.

chart2

Last thing that will also affect the boot time is hardware selection. There is a difference between the boards how fast they power up even if they are using the exact same CPU. Perhaps later something more about this.

Do:

  • Measure and analyze where time is spent
  • Set target goal, as early as possible
  • Try to reach the goal early in the development and then keep the level throughout development
  • When designing your software architecture take into account the startup targets
  • Optimize easy parts first, then continue to the details
  • Leverage static linking if that provides better result in your SW & HW configuration
  • Take into account your hardware limitations, preferably design the hardware to allow fast boot time

Do not:

  • Overestimate the performance of your selected hardware. i.MX28 will not give you iPad-like performance.
  • Complicate your software architecture. Simpler architecture runs faster.
  • Load things that are not necessary. Pre-built images contain features for many use cases, so optimization is typically needed.
  • Leave optimization at the end of the project
  • Underestimate the effort required for optimizing the very last milliseconds

So, that concludes our fast-boot blog post series. In these three posts, I showed you that Qt really is up for the task: It is possible to make Qt-powered devices to boot extremely fast to meet industry criteria. It's actually quite manageable when you know what you're doing but instead of one silver bullet, it's a combination of multiple things: good architectural SW design, bunch of Qt Quick tips'n'tricks, suitable hardware and a lot of system image optimization. Thank you for following!

 


Blog Topics:

Comments