I made available in the previous post an implementation, OctAlloc, which performs way better than the default heap allocator in Windows XP or 2003, and fares well against the Vista allocator in highly concurrent situations, but is less performant than the Vista allocator for single-threaded multi-core use.
I have since made two optimizations to OctAlloc which improve its performance beyond the Vista allocator in all respects. The optimizations are as follows:
- The previously posted version of OctAlloc uses thread-local storage to store the most likely CPU number that the current thread is running on. The TlsGetValue() and TlsSetValue() operations this requires are however somewhat expensive, and do not pay off when there's only a single thread. The new version uses a less reliable, more heuristic, but also highly more efficient strategy by using GetCurrentThreadId(), an inexpensive call, to choose the allocation buckets for the executing thread. This improves performance by some 25%.
- The previous version of OctAlloc always pops a memory chunk and returns it back to an atomic list after allocating a unit for it. This requires two synchronized operations for every call. The new version keeps handy an additional list of preallocated units which do not require popping and returning a memory chunk. This requires only one synchronized operation for most allocations.
Note that the performance figures are not comparable directly to the measurements I posted in the previous OctAlloc article. Previously, the test program kept the load-per-thread constant regardless of the number of test threads, causing 8- and 16-thread measurements to run very long. The new version of the test program varies the load-per-thread depending on the number of threads and the number of available CPUs. An algorithm with no contention should now complete in the same time regardless of the number of threads; if it takes longer to complete with multiple threads than it does with one, the difference is attributable directly to contention.
Download the new test binaries and the source code:
OctAlloc2.zip (51 kB)
Showing 7 out of 7 comments, oldest first:
Comment on Dec 17, 2007 at 20:29 by Anonymous
void* __fastcall OctAllocFlex(OctSizeT* size)
{
if (*size > c_bucketSizes[0])
{
*size = (*size + 0xFFFF) / 0x10000;
return VirtualAlloc(0, *size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
}
else
{
Is surely wrong, and indeed crashes my code when the VirtualAlloc is invoked as the size is divided by 0x10000. Sureley the code should be:
void* __fastcall OctAllocFlex(OctSizeT* size)
{
if (*size > c_bucketSizes[0])
{
*size = ((*size + 0xFFFF) / 0x10000)* 0x10000;
return VirtualAlloc(0, *size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
}
else
{
So the size is restored in the correct block size.
Is this right or I am going mad?
Thanks,
Gareth.
Comment on Dec 17, 2007 at 21:07 by denisbider
yes, there's that obvious bug. Your fix looks right.
I stumbled upon this bug after publishing OctAlloc, but I didn't get around to updating the published archive because it's somewhat trivial and I wasn't sure that anyone is even interested.
I will upload a fixed version now though. Thanks for reminding me.
Comment on Dec 17, 2007 at 21:17 by denisbider
Comment on Mar 29, 2010 at 23:01 by Anonymous
I'm a researcher from a Federal University from Brazil and I'm studing about memory allocators in general. In this moment, I'm researching about the Windows memory allocator: HeapAlloc.
Do you know how does it work? What kind of data structure is used (say, double linked list, queue, ...)? Where can I find more details about the source of the function HeapAlloc?
Thank you!
Comment on Mar 30, 2010 at 14:54 by denisbider
Comment on Sep 8, 2010 at 20:22 by Anonymous
Comment on Sep 13, 2010 at 22:50 by denisbider
there is a better than 99% probability that this article is completely unrelated to why you're having trouble with your computer.