The Wide Finder Project is an informal parallel programming competition where the task is to compute web site statistcs from a 218-million line access log. Each entry will be benchmarked on a Sun T2000 with support for 32 hardware threads, giving lots of opportunities for parallel processing.
What makes this really interesting is that the project is not only about performance, but rather about writing code that scales to many CPU cores with as little extra programmer effort as possible. Some results are already in, with OCaml currently in the lead performance-wise.
Each log line looks something like this:
www.example.com – – [17/Jun/2007:21:37:17 -0700] “GET /ongoing/ongoing.atom HTTP/1.1” 304 – – “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:18.104.22.168) Gecko/20070515 Firefox/22.214.171.124
Eager to bring the performance lead back to C++ where it belongs I started out writing my own implementation using QtConcurrent and the other Qt APIs. Briefly explained, the code uses QtConcurrent::mappedReduced to multi-thread the code, and then QByteArray::split() twice to iterate over each word in each line. The current version computes the number of hits for each page.
Results: (parsing 100K lines on an 8-core 2.8 GHz Mac Pro)
1.6X speedup.. not too bad.
1 second slower that the single-threaded version.. this does not bode well.
Seven seconds! We get a nice linear scaling of the run-time as we increase the number of threads, but unfortunately in the wrong direction. The program us spending a lot of user time doing something though, so let’s run it through Shark and see what’s going on:
80% in a spin-lock used by malloc/free. But who is calling malloc that much?
Aha.. QByteArray::split(). While being a very convenient API, split() was clearly not designed for heavy parsing like this. Still, I’d like a less catastrophic impact on the run-time when adding threads, even if the program really is calling malloc/free to often. Let’s try with the ptmalloc memory allocator instead:
ptmalloc is used in GNU/Linux though the GNU C library and scales much better on multicore systems. The program itself still does not scale beyond 4 threads, but it does not get significantly worse either when adding threads. I guess it’s debatable whether or not this is qualifies as a bug in the Darwin memory allocator, but at least ptmalloc shows that it is possible to do better.
That’s all for now 🙂 For the next installment I’ll try to get better scaling, at the expense of increasing the developer effort.