Wide Finding

Published Friday June 27th, 2008
6 Comments on Wide Finding
Posted in Qt Concurrent, Threads

The Wide Finder Project is an informal parallel programming competition where the task is to compute web site statistcs from a 218-million line access log. Each entry will be benchmarked on a Sun T2000 with support for 32 hardware threads, giving lots of opportunities for parallel processing.

What makes this really interesting is that the project is not only about performance, but rather about writing code that scales to many CPU cores with as little extra programmer effort as possible. Some results are already in, with OCaml currently in the lead performance-wise.

Each log line looks something like this:

www.example.com – – [17/Jun/2007:21:37:17 -0700] “GET /ongoing/ongoing.atom HTTP/1.1” 304 – – “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4

Eager to bring the performance lead back to C++ where it belongs I started out writing my own implementation using QtConcurrent and the other Qt APIs. Briefly explained, the code uses QtConcurrent::mappedReduced to multi-thread the code, and then QByteArray::split() twice to iterate over each word in each line. The current version computes the number of hits for each page.

Results: (parsing 100K lines on an 8-core 2.8 GHz Mac Pro)

1 Thread
real 0m2.283s
user 0m1.141s
sys 0m0.249s

2 Threads:
real 0m1.446s
user 0m1.853s
sys 0m0.271s

1.6X speedup.. not too bad.

4 Threads:
real 0m3.186s
user 0m10.643s
sys 0m0.407s

1 second slower that the single-threaded version.. this does not bode well.

8 Threads:
real 0m7.000s
user 0m46.922s
sys 0m0.724s

Seven seconds! We get a nice linear scaling of the run-time as we increase the number of threads, but unfortunately in the wrong direction. The program us spending a lot of user time doing something though, so let’s run it through Shark and see what’s going on:

spinlock.png

80% in a spin-lock used by malloc/free. But who is calling malloc that much?

spinlock-split.png

Aha.. QByteArray::split(). While being a very convenient API, split() was clearly not designed for heavy parsing like this. Still, I’d like a less catastrophic impact on the run-time when adding threads, even if the program really is calling malloc/free to often. Let’s try with the ptmalloc memory allocator instead:

8 Threads:
real 0m0.908s
user 0m3.784s
sys 0m0.533s

ptmalloc is used in GNU/Linux though the GNU C library and scales much better on multicore systems. The program itself still does not scale beyond 4 threads, but it does not get significantly worse either when adding threads. I guess it’s debatable whether or not this is qualifies as a bug in the Darwin memory allocator, but at least ptmalloc shows that it is possible to do better.

That’s all for now ๐Ÿ™‚ For the next installment I’ll try to get better scaling, at the expense of increasing the developer effort.

Do you like this? Share it
Share on LinkedInGoogle+Share on FacebookTweet about this on Twitter

Posted in Qt Concurrent, Threads

6 comments

Adriaan de Groot says:

Well, at this point you really should be using Studio 12 + Qt on Solaris to get the most out of it; you can do a lot more parallelism testing with the Studio 12 tools. I just saw a very effective demo of determining bottlenecks in parallelism on exactly the test setup you describe. One thing I see is indeed the *huge* number of small allocations going on in Qt apps. You might want to look at libumem (I tihnk that’s the ‘interesting’ allocator under Solaris).

babali says:

Hello,

With a garbage collector you can do slices without mallocing.
See http://dotnot.org/blog/index.php for example.

Alex says:

Where is a link to the sample data file or do we have to create it? This is a poorly documented project with no clear rules/data/submit details, feels like I’m in freshman college without directions to the computer lab.

Adam Higerd says:

Actually, babali has a good point, although I wouldn’t necessarily use a garbage collector for it… but it should be feasible to deal in implicitly-shared substrings; I can’t see why it wouldn’t work. (Doesn’t QString even offer some explicitly-shared substring functions?) It wouldn’t even break API/ABI to switch QByteArray::split to return implicitly-shared substrings — might be a fun little experiment.

In other news, I’ve heard jemalloc is designed for this kind of parallelism; you might check it out.

Morten says:

Alex: The instructions for getting the data files can be found here: http://wikis.sun.com/display/WideFinder/Infrastructure

Adam: QStringRef is probably the class you’re thinking about. Modifying QByteArray:split could be possible, but unfortunately you still have to allocate a d-pointer for each returned bytearray (QStringRef shares a d-pointer with the referenced string, making it allocation free.) jemalloc looks promising, it would be interesting to benchmark it against ptmalloc and nedmalloc.

Adam Higerd says:

Yeah, QStringRef is exactly what I was talking about, but that is a point I hadn’t considered; QByteArray::fromData does still have to construct a QByteArray object, just not a buffer for it, and while QByteArrayRef could probably work it wouldn’t be a drop-in replacement — you’d need a new API. *sigh* Well, it was an idea. (And it’ll be fewer and smaller allocations than the current implementation anyway — it would still be a performance benefit for applications using split() even if it isn’t quite as parallelizable.)

Commenting closed.

Get started today with Qt Download now