Using OpenCL with Qt

Recently we have been experimenting with OpenCL and Qt, to see what Qt needs to make it easier to use OpenCL and to see what Qt could use it for internally.  In this post we are going to give an introduction to OpenCL, the QtOpenCL wrapper library, show how to write your first QtOpenCL program, and tell you where to get more information on the project.

What is OpenCL anyway?

For those new to it, OpenCL is an open, royalty-free standard for parallel programming in a heterogeneous computing environment. The most common use you've probably heard of is to run arbitrary C code on your system's GPU. These days GPU's are more powerful than CPU's, having been designed to pump out hundreds of thousands of textured triangles per second in your favorite shoot-em-up video game.  To do this, the GPU has access to parallel vector processing that far exceeds the capability of x86/SSE or ARM/NEON instructions on your average CPU.

For years, shader languages like GLSL have made the vector capabilities of the GPU available for arbitrary shader effects in OpenGL, but you are basically limited to whatever parameters a "draw triangle" call takes.  It's also quite typical for OpenGL implementations to cut corners by using fixed-point and lower precisions.  The shader source code may say "float", but it could be as little as 8 bits of actual precision.  While great for pumping out triangles where you won't notice an "off-by-0.001" error, this isn't very useful for supercomputing, common mathematical algorithms, and super-precise pixel blending.

Enter OpenCL.  It defines a new C-style language that is more precise as to mathematical precision, and which allows arbitrary arguments to be provided to an OpenCL function - known as a kernel - to do almost anything that C can do. Special vector types like "float4" are provided as well as an extensive mathematical library.  But its most impressive feature is work sizes - it is very easy to split your task up into small chunks that the GPU can scatter across all of its compute units (compute units include whatever CPUs and GPUs OpenCL can find, OpenCL uses everything available).  Unlike regular C where you can spend a lot of time writing outer loops and launching worker threads for subparts of your problem, OpenCL does it for you. We'll see how that works shortly.

QtOpenCL

The QtOpenCL library wraps the OpenCL 1.0 API in a Qt-style API.  It takes the pain out of OpenCL initialization, program compilation, and kernel execution.  It also provides convenience functions for interfacing to existing Qt facilities such as QImage and QtOpenGL.

The following links should get you started with downloading and using QtOpenCL with either Qt 4.6 or 4.7:

QtOpenCL is still a work in progress, distributed as a standalone module outside of the normal Qt source repositories.  Suggestions and patches are welcome to make it better.

Hello QtOpenCL

We are going to make a simple program that modifies an image by multiplying the grayscale version of an image by a color.  The code is in the QtOpenCL repository under the "examples/opencl/colorize" directory. We'll be using the following member variables in the ColorizeWidget class:

QCLContext context;
QCLProgram program;
QCLKernel colorize;
QImage dstImage;
QCLImage2D srcImageBuffer;
QCLImage2D dstImageBuffer;
QColor color;

The first thing we need to do is create the QCLContext, which determines which CPU or GPU computing device to use and opens it for our use:

if (!context.create())
    qFatal("Could not create OpenCL context");

In this example we don't really care if the computing device is a CPU or GPU, but if we really wanted to use the same GPU as the OpenGL implementation, we could do this instead:

if (!context.create(QCLDevice::GPU))
    qFatal("Could not create OpenCL context");

The next thing we need to do is build our OpenCL program from the colorize.cl source file:

program = context.buildProgramFromSourceFile(QLatin1String(":/colorize.cl"));

Now might be a good time to look at the OpenCL code itself inside colorize.cl:

const sampler_t samp = CLK_ADDRESS_CLAMP_TO_EDGE |
                       CLK_FILTER_LINEAR;
__kernel void colorize(__read_only image2d_t srcImage,
                       __write_only image2d_t dstImage,
                       float4 color)
{
    int2 pos = (int2)(get_global_id(0), get_global_id(1));
    float4 srcColor = read_imagef(srcImage, samp, pos);
    float gray = srcColor.x * 11.0f / 32.0f +
                 srcColor.y * 16.0f / 32.0f +
                 srcColor.z * 5.0f / 32.0f;
    float4 pixel = (float4)(color.xyz * gray, srcColor.w);
    write_imagef(dstImage, pos, clamp(pixel, 0.0f, 1.0f));
}

We'll break it down step by step:

  • The "__kernel" keyword introduces a special entry point function called "colorize" that we will be using later in our C++ code.
  • The "colorize" entry point takes three parameters corresponding to the source image, destination image, and the color to combine with the image.
  • The "pos" variable is set to a 2-dimensional int vector that contains the 0th and 1st global identifiers.  What?  Well, in OpenCL, every kernel execution is given an implicit argument that indicates which part of the overall work job is being performed.  In our case, we use the (x, y) co-ordinates of the image pixel we want to process.  So this line is basically fetching the current pixel, and the "colorize" function is only working on one pixel at a time.
  • A "srcColor" value is read from the source image at "pos".  You can ignore the "samp" sampler for now - it's an OpenCL technique for adjusting how values are extracted from image objects - we're using a simple linear sampler.
  • We convert the "srcColor" into grayscale and then combine it with the "color".
  • Finally, we write the pixel to the destination image.

This is all fairly straight-forward.  The main difference with a regular C function to do the above is that we haven't included the for loops to iterate over x and y - OpenCL will be providing them for us.  Back to the C++ code now.  OpenCL kernels running in a computation device cannot directly access host memory, so we need to arrange to copy our source image into an image buffer:

QImage img(QLatin1String(":/qtlogo.png"));
srcImageBuffer = context.createImage2DCopy(img, QCLMemoryObject::ReadOnly);

We specify the source image as "ReadOnly" because the kernel will be reading it. The destination image is created in a similar fashion, but as "WriteOnly":

dstImage = QImage(img.size(), QImage::Format_ARGB32);
dstImageBuffer = context.createImage2DDevice(dstImage.format(), dstImage.size(), QCLMemoryObject::WriteOnly);

The createImage2DDevice() function creates an image in the fastest possible OpenCL device memory. The final initialization step is to find the kernel entry point:

colorize = program.createKernel("colorize");
colorize.setGlobalWorkSize(img.size());
colorize.setLocalWorkSize(8, 8);

We set the "global" work size to the dimensions of the image, which causes OpenCL to create the implicit for loops that iterate over the x and y values for us. We set the "local" work size to 8x8, which indicates that OpenCL should process the data in 8x8 chunks and that every item in the chunk can be processed in parallel. This is how OpenCL gets its performance boost: by tweaking the local work size, we can tune the parallelism to make efficient use of the computing resources. I've found that 8x8 works quite well for images, so that's what we'll use in this example. Now that we have initialized our OpenCL context and kernel, it is on to the paintEvent():

colorize(srcImageBuffer, dstImageBuffer, color);
dstImageBuffer.read(&dstImage);
QPainter painter(this);
painter.drawImage(0, 0, dstImage);

The first line executes the kernel for us with QCLKernel's operator() override. The second line then reads the contents of "dstImageBuffer" from the OpenCL device back into "dstImage" in host memory. And then we paint it to the window as per normal Qt. And that's basically it!

Well ... not so fast!  I glossed over one small little detail - the kernel executes in the background and returns to the C++ program immediately.  So after the first line, execution will continue.  But the read() call will automatically block waiting for the kernel to complete execution, so all is fine in this example.  But if we really wanted to wait for the kernel to complete execution, we can use a QCLEvent:

QCLEvent event = colorize(srcImageBuffer, dstImageBuffer, color);
event.waitForFinished();
dstImageBuffer.read(&dstImage);
QPainter painter(this);
painter.drawImage(0, 0, dstImage);

Other Examples

The QtOpenCL repository has a number of examples that you can play with:

  • Vector addition example - another simple introduction to QtOpenCL.
  • Mandelbrot viewer program that demonstrates generating QImage data and GL textures via OpenCL.
  • Gaussian blur example and benchmarks that compare it with Qt's graphics effects.
  • Bezier patch sub-division example to demonstrate using OpenCL like a geometry shader to generate large numbers of vertices.
  • Simple path and image drawing and blending.

And now an obligatory screenshot.  The mandelbrot viewer zooms into the well-known set down to this image:

Mandelbrot screenshot

On my Linux desktop's NVIDIA GeForce GTX 275, this can get up to 120 frames per second, running across 30 compute units, without breaking a sweat.  To put this into perspective, the same algorithm running on the CPU struggles to achieve 5 frames per second.  Offloading all that work, and breaking it up into 8x8 work chunks makes a huge difference (initially performance wasn't that great until I realized that it was using a 1x1 work size).

QtOpenCL and QtConcurrent

There is a little bit of interaction between QtOpenCL and QtConcurrent, as described here. Because of QtConcurrent's background on homogeneous multi-core CPU's, there's a bit of work that needs to be done to truly marry the two worlds, but nothing is impossible. For now, the most useful feature of the interaction is that you can get a QFuture for a kernel execution and pass it to a QFutureWatcher for signal notification:

QCLEvent event = kernel(arg1, arg2);
QFutureWatcher<void> *watcher = new QFutureWatcher<void>(this);
watcher->setFuture(event.toFuture());
connect(watcher, SIGNAL(finished()), this, SLOT(eventFinished()));

Or alternatively, using the implicit conversion between QCLEvent and QFuture:

QFutureWatcher<void> *watcher = new QFutureWatcher<void>(this);
watcher->setFuture(kernel(arg1, arg2));
connect(watcher, SIGNAL(finished()), this, SLOT(eventFinished()));

Embedded Devices

Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we've tested it under Linux, Mac, and Windows). Embedded devices are currently another matter - OpenCL implementations are still very basic in that space.  The performance improvements on embedded CPU's are only slightly better than using ARM/NEON instructions for example.  And embedded GPU's are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing.  But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with.  Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!

Future Work

A lot of stuff remains to be done, particularly with respect to how we can use OpenCL inside Qt itself.  There are many places where it could be useful:

  • Accelerating image blending and path drawing in the raster paint engine.
  • Fast on-the-fly decompression and scaling of JPEG images.
  • Graphics effects: blur, colorize, bloom, etc, etc, etc.
  • Particle effects and other physics simulations.
  • Mesh subdivision and morphing algorithms in Qt/3D.

The possibilities are endless.  We look forward to your patch! :-)


Blog Topics:

Comments