Fixing a flaky test - how hard can it be?

Flaky tests are a pain point for any reasonable continuous integration system. And by flaky I refer to tests that sometimes pass and sometimes fail. Lately I've been bothered by a lack of data and metrics when it comes to discussions around the topic of flaky tests (it pops up regularly on the Qt development mailing list). It's a hot topic because it influences our productivity, many randomly failing tests slow everyone down.

Metrics

While we worked a lot on our testing infrastructure, a good way of gathering metrics was lacking. For a while we tried building a custom website, but it wasn't really maintained, so it slowly died.
For a completely unrelated project, I started running influxdb and grafana on a server. Yes, that's Qt's new and shiny benchmark metrics infrastructure. I recently gained some experience with these tools, therefore I decided to log data from Coin.

As a first step logging all failing tests seemed like a good idea. I have a feeling that it has already paid off big time, we now have top lists of flaky and failing tests.

Grafana showing a list of flaky tests

I haven't done as much C++ work lately as I'd have liked to. And I really felt like spending a bit of time on fixing one or two issues. Where to start? I picked the top candidate off of the list of failing tests. Most often failing and blocking integrations was a test called TouchMouse (written by yours truly, in a past life, and enhanced by others). After looking at the data, it was clear that it was only failing on macOS. Not great for me, my more powerful machine is running Linux. Could I avoid booting up my old, old Mac Mini? My medium old Mac Pro had just died and replacement is still on its way...

Picking up the TouchMouse Test

Let's open the test's cpp file to get a feel for the test:

...
// ------------------------- Mouse move to mouseArea1
QTest::mouseMove(window.data(), p1);

QVERIFY(enterSpy1.count() == 1); QVERIFY(mouseArea1->hovered()); QVERIFY(!mouseArea2->hovered());

// ------------------------- Touch click on mouseArea1 QTest::touchEvent(window.data(), device).press(0, p1, window.data());

QCOMPARE(enterSpy1.count(), 1); QCOMPARE(enterSpy2.count(), 0); QVERIFY(mouseArea1->pressed()); QVERIFY(mouseArea1->hovered()); QVERIFY(!mouseArea2->hovered());

QTest::touchEvent(window.data(), device).release(0, p1, window.data()); QVERIFY(clickSpy1.count() == 1); QVERIFY(mouseArea1->hovered()); QVERIFY(!mouseArea2->hovered());

// ------------------------- Touch click on mouseArea2 if (QGuiApplication::platformName().compare(QLatin1String("xcb"), Qt::CaseInsensitive) == 0) QSKIP("hover can be momentarily inconsistent on X11, depending on timing of flushFrameSynchronousEvents with touch and mouse movements (QTBUG-55350)");

QTest::touchEvent(window.data(), device).press(0, p2, window.data());

QVERIFY(mouseArea1->hovered()); QVERIFY(mouseArea2->hovered()); QVERIFY(mouseArea2->pressed()); QCOMPARE(enterSpy1.count(), 1); QCOMPARE(enterSpy2.count(), 1);

QTest::touchEvent(window.data(), device).release(0, p2, window.data());

QVERIFY(clickSpy2.count() == 1); QVERIFY(mouseArea1->hovered()); QVERIFY(!mouseArea2->hovered()); QCOMPARE(exitSpy1.count(), 0); QCOMPARE(exitSpy2.count(), 1); ...

Did you spot it too? Yeah, we skip the test on xcb... now what happens if I comment that out? Let's run it on Linux for the fun of it.

In the past year I switched to fish as my main shell, that allows me to write while and for loops, without having to read the manual. I never managed to just write them when using bash, but now we're really getting side-tracked...

set i 0; while
./tst_touchmouse hoverEnabled
set i (math $i+1)
echo $i
end

is how I run the test in a loop with a nice counter. Good luck getting out of there, pressing ctrl-c at random intervals helps eventually. And wow, the issue can be reproduced immediately (after removing the QSKIP which simply ends the test otherwise).

Understanding the issue

Glancing at the code in QQuickMouseArea, I saw that we have logging code in there, so I opened ~/.config/QtProject/qtlogging.ini and added:

[Rules]
qt.quick.hover.trace=true

Logging for the win! There were three scenarios (see the bug report for details). Sometimes the first MouseArea would get "unhovered", sometimes not.
I got a nice backtrace by cheating a bit: the extra "unhover" event was in mouse area 1... if only I could break when that happens... I'm weak at writing conditional breakpoints. Let me add a small snippet that will help breaking in the right place in QQuickMouseArea's code instead:

void QQuickMouseArea::setHovered(bool h)
{
...
    if (objectName() == "mouseArea1" && !h) {
        // break point here
        qDebug() << "mouseArea1 got unhovered";
    }
...
}

Do not forget to set the objectName in QML, otherwise this won't work. And observe that flushFrameSynchronousEvents runs at regular but indeterministic intervals. Uhhh... that could explain the flakyness. Re-run and see different traces. And for good measure, add one more debug line into QQuickWindowPrivate::flushFrameSynchronousEvents.

...
if (!q->mouseGrabberItem() && !lastMousePosition.isNull() && dirtyItemList) {
    bool accepted = false;
    bool delivered = deliverHoverEvent(contentItem, lastMousePosition, lastMousePosition, QGuiApplication::keyboardModifiers(), 0, accepted);
    if (!delivered)
        clearHover(); // take care of any exits
    qDebug() << Q_FUNC_INFO << "delivered artificial hover: " << delivered << lastMousePosition;
}

We see who's making this test non-deterministic. Let's check if there's a bug report for this... JIRA instantly shows QTBUG-55350 as first hit when I copy "hoverEnabled" into the search field. Suddenly the day was over and I had other business to mind.

Fixing the flaky test

Scene cut; next morning. After breakfast and walking to work while it was snowing (on the 10th of May), it's back to debugging.
I had the idea that I could comment out some of the test verification and see if everything else is in order. To my dismay, I quickly saw, that something was still off. The second mouse area in the test would suddenly receive hover events after a touch release. This didn't make sense because we never move the mouse there. Therefore the hover event must be triggered by a touch event. But after a touch release?

At this point, it was fairly easy to spot. The code that sends "pretend" hover events was doing something funny: sending a hover event after a touch release (through flushFrameSynchronousEvents, thus randomly timing dependent). Resetting the variable that keeps track of the last mouse position fixed the problem. That way no more fake events for touch releases are delivered. Suddenly the test completely worked and also on Linux/XCB where it was skipped before. More test coverage enabled and one bug down left a happy developer.
The patch went in and we should have the worst offender removed. I can't wait to see how the metrics develop over time. I'm sure there'll be something else coming up again. But now we have metrics to battle it ;-)


Blog Topics:

Comments