Analyzing benchmark results with BM

Benchmarking is an important concept for evaluating the performance of computer systems. It can be considered essentially a three-step process:

Step 1
Define the benchmarks.
Step 2
Execute the benchmarks and produce numbers.
Step 3
Analyze the benchmark numbers.

Each step has its own challenges: Step 1 is about writing programs that test sufficiently interesting/relevant parts of the target product. In Step 2 one needs to define what resource to measure and ensure that the measurements are as reliable as possible in various software/hardware contexts (note that QTestLib provides some help here through the -median option and by supporting callgrind). Step 3 is concerned with "higher level" analysis problems like how to detect performance regressions as reliably as possible, how to compare different implementations of a particular algorithm and so on.

This article introduces BM, a tool that attempts to solve some of the problems that may arise in Step 3. In particular we focus on how potential performance regressions can be detected. Currently the idea is to evaluate this tool for our internal QA needs, but we hope that others may find it interesting too and give us some feedback.

Benchmark results and execution context

BM is based on QTestLib concepts: First, a benchmark is identified by the combination of a test case, a test function, and a data tag (optional). The benchmark should execute some part(s) of the target product. Second, a benchmark result (i.e. the result of executing a particular benchmark) is essentially a single, non-negative number.

A benchmark result is produced within a context made up of the following components:

Metric
The resource used for measuring the benchmark.
Platform
The compiler and OS used for building and executing the benchmark.
Host
The physical computer on which the benchmark was executed.
Branch
The development branch (of the target product) tested by the executable benchmark program.
Snapshot
The timestamp and ID of the current branch revision.

Note that the host may identify a virtual machine that may run on different physical hosts. In this case the physical hosts should be as identical as possible with respect to both hardware and software.

BM currently assumes that git is used for revision control. The ID in the snapshot is the SHA-1 of the head commit at the time the benchmark was built, and the timestamp is the committer date.

Time series as snapshots of revision history

By varying the snapshot component only, successive runs of a particular benchmark form a time series of benchmark results. The notion of a performance regression is defined with respect to this time series. The question to ask is essentially:

How likely is it that a modification to the product itself is affecting the performance of the benchmark in a negative way?

(It would of course also be interesting to know if the benchmark performance is affected in a positive way — BM makes no distinction between the two cases.)

Ideally, new benchmark results should be produced each time the revision history is advanced. In practice this may be hard to achieve if the time it takes to produce a new set of benchmark results exceeds the product update interval.

Note that any modification done to the host or platform of a particular time series (e.g. reconfigurations or upgrades) will in theory invalidate the history in the sense that it may not make sense to compare results before and after the modification. The BM tool currently has no support for this scenario.

Change and stability

For our purpose, the only really interesting property of a sequence of benchmark results is how the results change along the revision history. Moreover, it is normally the current (i.e. most recent) change that we're most interested in. Who cares about how the benchmark performance changed one year ago if some code that was submitted yesterday is likely to cause a bad performance regression?

Ideally, Step 2 (defined earlier) will produce reliable numbers completely free of "statistical effects", but for various reasons this cannot be guaranteed in a real life scenario. Since nobody wants to waste their time chasing "false positives", we need some mechanism to help us reason about the confidence. BM takes a heuristic approach to this problem by letting the user define a difference tolerance (DT) and a stability tolerance (ST):

  • Two values are considered significantly equal if their relative difference does not exceed DT (i.e. if |(v1 - v2) / v1| ≤ DT)
  • If vn is the last value in the time series, the current change (if any) is the latest value vc before vn that differs significantly from vn.
  • A value v is considered stable if it is preceded consecutively by at least ST values that are significantly equal to v.
  • The current change is stable if both vc and vn are stable.

The following figures show examples of stable and non-stable current changes (ST is 4 in all cases):

Stable current change
Non-stable current change 1
Non-stable current change 2

By sorting the benchmarks on their current changes (considering both sign and magnitude) and on stability, we hope to get a useful indication of likely candidates for performance regressions. Once a potential regression is detected, it may need additional manual verification before eventually pinpointing the code modification(s) that caused it.

Architecture

The BM system consists of the following main components:

  • a bmserver (written in Qt) for managing the database of benchmark results,
  • a bmclient (written in Qt) for uploading data to and retrieving data from the bmserver, and
  • a web client (written in JavaScript) for presenting and exploring the most interesting benchmark results.

New benchmark results are uploaded to the bmserver by invoking the bmclient with the benchmark results (in a QTestLib XML output file) along with the execution context. Benchmark results already in the database may be retrieved directly through the bmclient (which can be useful for debugging), but normally it is better to use the web client. In the latter case the bmclient is invoked as a CGI program by a web server.

The following figure shows the client/server architecture with arrow directions indicating the main data flow:

Client server architecture

The bmserver and bmclient communicate over an XML-based protocol, while communication between the web server and web client is based on Ajax and JSON.

For simplicity, the bmserver currently uses SQLite, but we plan to allow for other database backends (like PostgreSQL) if necessary.

Web interface

The web interface is organized into three main sections:

  • A platform summary section that shows overall statistics for each platform/branch combination.
  • A benchmark summary section that shows more detailed statistics for the most interesting benchmarks of a particular platform/branch combination (the ranking criteria may be defined by the user). Other features in this section include detailed views of individual time series, comparing the latest results of two branches, and comparing results in a local file against results in the database.
  • An overall settings section.

Here are some screenshots:

Platform summary section


Benchmark summary section


Table


Graph


Graph 2

Note: BM is currently known to work properly with Firefox only (partly due to the use of the HTML canvas tag for graphics).

Status and plans

It goes without saying that a proper assessment of the usefulness of BM depends on the database containing a fair amount of production data. As we speak, the Qt benchmarks (under the /tests/benchmarks/ directory) are being run at regular intervals on a Linux platform.

We would like to invite anyone interested to give us feedback on the BM tool. Does this approach make sense? How can it be improved? Etc. The repository is here (and there's a README to get you started).


Blog Topics:

Comments