Laszlo Agocs

Qt Weekly #28: Qt and CUDA on the Jetson TK1

Published Tuesday March 3rd, 2015
7 Comments on Qt Weekly #28: Qt and CUDA on the Jetson TK1
Posted in Compilers, Embedded, Graphics, OpenGL, QPA

NVIDIA’s Jetson TK1 is a powerful development board based on the Tegra K1 chip. It comes with a GPU capable of OpenGL 4.4, OpenGL ES 3.1 and CUDA 6.5. From Qt’s perspective this is a somewhat unorthodox embedded device because its customized Linux system is based on Ubuntu 14.04 and runs the regular X11 environment. Therefore the approach that is typical for low and medium-end embedded hardware, running OpenGL-accelerated Qt apps directly on the framebuffer using the eglfs platform plugin, will not be suitable.

In addition, the ability to do hardware-accelerated computing using CUDA is very interesting, especially when it comes to interoperating with OpenGL. Let’s take a look at how CUDA code can be integrated with a Qt-based application.

Jetson TK1

The board

Building Qt

This board is powerful enough to build everything on its own without any cross-compilation. Configuring and building Qt is no different than in any desktop Linux environment. One option that needs special consideration however is -opengl es2 because Qt can be built either in a GLX + OpenGL or EGL + OpenGL ES configuration.

For example, the following configures Qt to use GLX and OpenGL:

configure -release -nomake examples -nomake tests

while adding -opengl es2 requests the usage of EGL and OpenGL ES:

configure -release -opengl es2 -nomake examples -nomake tests

If you are planning to run applications relying on modern, non-ES OpenGL features, or use CUDA, then go for the first. If you however have some existing code from the mobile or embedded world relying on EGL or OpenGL ES then it may be useful to go for #2.

The default platform plugin will be xcb, so running Qt apps without specifying the platform plugin will work just fine. This is the exact same plugin that is used on any ordinary X11-based Linux desktop system.

Vsync gotchas

Once the build is done, you will most likely run some OpenGL-based Qt apps. And then comes the first surprise: applications are not synchronized to the vertical refresh rate of the screen.

When running for instance the example from qtbase/examples/opengl/qopenglwindow, we expect a nice and smooth 60 FPS animation with the rendering thread throttled appropriately. This unfortunately isn’t the case. Unless the application is fullscreen. Therefore many apps will want to replace calls like show() or showMaximized() with showFullScreen(). This way the thread is throttled as expected.

A further surprise may come in QWidget-based applications when opening a popup or a dialog. Unfortunately this also disables synchronization, even though the main window still covers the entire screen. In general we can conclude that the standard embedded recommendation of sticking to a single fullscreen window is very valid for this board too, even when using xcb, although for completely different reasons.


After installing CUDA, the first and in fact the only challenge is to tackle the integration of nvcc with our Qt projects.

Unsurprisingly, this has been tackled by others before. Building on this excellent article, the most basic integration in our .pro file could look like this:

... # QT, SOURCES, HEADERS, the usual stuff 


CUDA_DIR = /usr/local/cuda
CUDA_ARCH = sm_32 # as supported by the Tegra K1

LIBS += -L $$CUDA_DIR/lib -lcudart -lcuda
osx: LIBS += -F/Library/Frameworks -framework CUDA

cuda.commands = $$CUDA_DIR/bin/nvcc -c -arch=$$CUDA_ARCH -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
cuda.depend_command = $$CUDA_DIR/bin/nvcc -M ${QMAKE_FILE_NAME}
cuda.input = CUDA_SOURCES
cuda.output = ${QMAKE_FILE_BASE}_cuda.o

In addition to Linux this will also work out of the box on OS X. Adapting it to Windows should be easy. For advanced features like reformatting nvcc’s error messages to be more of Creator’s liking, see the article mentioned above.

A QOpenGLWindow-based application that updates an image via CUDA on every frame could now look something like the following. The approach is the same regardless of the OpenGL enabler in use: QOpenGLWidget or a custom Qt Quick item would operate along the same principles: call cudaGLSetGLDevice when the OpenGL context is available, register the OpenGL resources to CUDA, and then do map – invoke CUDA kernel – unmap – draw on every frame.

Note that in this example we are using a single pixel buffer object. There are other ways to do interop, for example we could have registered the GL texture, got a CUDA array out of it and bound that either to a CUDA texture or surface.

// functions from
extern void CUDA_init();
extern void *CUDA_registerBuffer(GLuint buf);
extern void CUDA_unregisterBuffer(void *res);
extern void *CUDA_map(void *res);
extern void CUDA_unmap(void *res);
extern void CUDA_do_something(void *devPtr, int w, int h);

class Window : public QOpenGLWindow, protected QOpenGLFunctions
    void initializeGL();
    void paintGL();

    QSize m_imgSize;
    GLuint m_buf;
    GLuint m_texture;
    void *m_cudaBufHandle;


void Window::initializeGL()

    QImage img("some_image.png");
    m_imgSize = img.size();
    img = img.convertToFormat(QImage::Format_RGB32); // BGRA on little endian
    glGenBuffers(1, &m_buf);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf);
    glBufferData(GL_PIXEL_UNPACK_BUFFER, m_imgSize.width() * m_imgSize.height() * 4, img.constBits(), GL_DYNAMIC_COPY);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

    m_cudaBufHandle = CUDA_registerBuffer(m_buf);

    glGenTextures(1, &m_texture);
    glBindTexture(GL_TEXTURE_2D, m_texture);

    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, m_imgSize.width(), m_imgSize.height(), 0, GL_BGRA, GL_UNSIGNED_BYTE, 0);


void Window::paintGL()

    void *devPtr = CUDA_map(m_cudaBufHandle);
    CUDA_do_something(devPtr, m_imgSize.width(), m_imgSize.height());

    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf);
    glBindTexture(GL_TEXTURE_2D, m_texture);
    // Fast path due to BGRA
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, m_imgSize.width(), m_imgSize.height(), GL_BGRA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

    ... // do something with the texture

    update(); // request the next frame

The corresponding

#include <stdio.h>
#ifdef Q_OS_MAC
#include <OpenGL/gl.h>
#include <GL/gl.h>
#include <cuda.h>
#include <cuda_gl_interop.h>

void CUDA_init()
    cudaDeviceProp prop;
    int dev;
    memset(&prop, 0, sizeof(cudaDeviceProp));
    prop.major = 3;
    prop.minor = 2;
    if (cudaChooseDevice(&dev, &prop) != cudaSuccess)
        puts("failed to choose device");
    if (cudaGLSetGLDevice(dev) != cudaSuccess)
        puts("failed to set gl device");

void *CUDA_registerBuffer(GLuint buf)
    cudaGraphicsResource *res = 0;
    if (cudaGraphicsGLRegisterBuffer(&res, buf, cudaGraphicsRegisterFlagsNone) != cudaSuccess)
        printf("Failed to register buffer %u\n", buf);
    return res;

void CUDA_unregisterBuffer(void *res)
    if (cudaGraphicsUnregisterResource((cudaGraphicsResource *) res) != cudaSuccess)
        puts("Failed to unregister resource for buffer");

void *CUDA_map(void *res)
    if (cudaGraphicsMapResources(1, (cudaGraphicsResource **) &res) != cudaSuccess) {
        puts("Failed to map resource");
        return 0;
    void *devPtr = 0;
    size_t size;
    if (cudaGraphicsResourceGetMappedPointer(&devPtr, &size, (cudaGraphicsResource *) res) != cudaSuccess) {
        puts("Failed to get device pointer");
        return 0;
    return devPtr;

void CUDA_unmap(void *res)
    if (cudaGraphicsUnmapResources(1,(cudaGraphicsResource **) &res) != cudaSuccess)
        puts("Failed to unmap resource");

__global__ void run(uchar4 *ptr)
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int offset = x + y * blockDim.x * gridDim.x;


void CUDA_do_something(void *devPtr, int w, int h)
    const int blockSize = 16; // 256 threads per block
    run<<<dim3(w / blockSize, h / blockSize), dim3(blockSize, blockSize)>>>((uchar4 *) devPtr);

This is all that’s needed to integrate the power of Qt, OpenGL and CUDA. Happy hacking!

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in Compilers, Embedded, Graphics, OpenGL, QPA


Scorp1us says:

I don’t get why Qt is going CUDA. You should be young OpenCL. Murray hardware supports 1.2 these days if they don’t support 2.0

With Qt’s fundamental x-platform focus it just seems illogical to target CUDA specifically .

Laszlo Agocs Laszlo Agocs says:

Qt isn’t going CUDA, this is simply showing that nothing is blocking people from using CUDA in their Qt apps, if they wish to.

The hardware’s ability to support an API and having supported drivers and tools for that API are two different things. For the time being NVIDIA supports CUDA only for this board.

Scorp1us says:

Oh I see. I was hoping that Qt would release a “QOpenCL” library that would feature integration with QImage, QTransform, etc.

Michal says:

Can you please provide an example which uses QQuickItem instead of QOpenGLWindow?

Laszlo Agocs says:

Use QQuickFramebufferObject. The example code from initializeGL and paintGL is almost completely the same. See e.g. or for examples on QQuickFramebufferObject usage.

Michal says:

I am using QOpenGLFramebufferObject. I set size and CombinedDepthStencil. Then I call glGetFramebufferAttachmentParameteriv to get depth attachment pointer which I use in cudaGraphicsGLRegisterBuffer which returns success. Now in render function I call cudaGraphicsMapResources and then cudaGraphicsResourceGetMappedPointer which returns success. However, the CUDA device memory size returned is wrong. It is 92 bytes, but my FBO is created with 320×200. What am I doing wrong?

Laszlo Agocs Laszlo Agocs says:

First of all you should use cudaGraphicsGLRegisterImage to register a renderbuffer. After that you will probably run into issues with the depth attachment because CUDA does not seem to support GL textures and renderbuffers of format GL_DEPTH_COMPONENT. (see cudaGraphicsGLRegisterImage docs)
You may need to use a regular color buffer, get the depth values into that, and register that renderbuffer to CUDA instead.

Commenting closed.

Get started today with Qt Download now