I wrote previously about my attempts to engineer a concurrency-friendly memory allocation algorithm for Windows platforms that would be efficient for the most demanding multi-threaded and multi-core use.

I made available in the previous post an implementation, OctAlloc, which performs way better than the default heap allocator in Windows XP or 2003, and fares well against the Vista allocator in highly concurrent situations, but is less performant than the Vista allocator for single-threaded multi-core use.

I have since made two optimizations to OctAlloc which improve its performance beyond the Vista allocator in all respects. The optimizations are as follows:
  • The previously posted version of OctAlloc uses thread-local storage to store the most likely CPU number that the current thread is running on. The TlsGetValue() and TlsSetValue() operations this requires are however somewhat expensive, and do not pay off when there's only a single thread. The new version uses a less reliable, more heuristic, but also highly more efficient strategy by using GetCurrentThreadId(), an inexpensive call, to choose the allocation buckets for the executing thread. This improves performance by some 25%.
  • The previous version of OctAlloc always pops a memory chunk and returns it back to an atomic list after allocating a unit for it. This requires two synchronized operations for every call. The new version keeps handy an additional list of preallocated units which do not require popping and returning a memory chunk. This requires only one synchronized operation for most allocations.
My measurements for the new version follow at the bottom of this post below. According to my measurements, OctAlloc outperforms the Vista allocator in performance as well as tightness in managing memory.

Note that the performance figures are not comparable directly to the measurements I posted in the previous OctAlloc article. Previously, the test program kept the load-per-thread constant regardless of the number of test threads, causing 8- and 16-thread measurements to run very long. The new version of the test program varies the load-per-thread depending on the number of threads and the number of available CPUs. An algorithm with no contention should now complete in the same time regardless of the number of threads; if it takes longer to complete with multiple threads than it does with one, the difference is attributable directly to contention.

Download the new test binaries and the source code:

OctAlloc2.zip (51 kB)