The array-of-freelists that the buddy system uses is great for allocations. If there's already a free item of the right size, it can be returned in constant time.
In a "bucket" allocator, we shoot for that O(1) allocation performance by completely separating our free lists. When an allocation request of a given size comes in we allocate a new block of memory specifically to put on that size-specific free list (or "bucket").
If we allowed any arbitrary size allocation, then this could result in up to 4096 free lists just for sizes up to one page. That's feasible, but it wouldn't have great performance. Instead, a bucket allocator only handles a set list of sizes - commonly powers of two and the half points between powers of two. So for sizes up to a page, the list might be:
4, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256,
384, 512, 768, 1024, 1536, 2048, 3192, 4096
That's 20 sizes, so allocating a page for each size might cost 80k.
Problem: Where can we put the size field to avoid having power-of-two allocations overflow to the next bucket up?
Solution: At the beginning of the page. All the allocations are the same size, and we can find the beginning of the page with pointer arithmetic, so no need for multiple size fields. This means that our free list becomes a list of blocks (pages) with at least one free slot.
Problem: How do we coalesce?
Solution: Since a 64-byte block is always a 64-byte block, we only need to worry about the case where an entire page is free so we can unmap it. We can store "free" bits as a bitmap at the beginning of the page. Even for 4 byte allocations the bitmap only needs to be 128 bytes = 16 words = 2 cache lines long.
Problem: How do we handle sizes allocations >= 1 page? Rounding down the pointer no longer gets you to the beginning of the block if there are multiple allocations per block?
Solution: Each allocation needs its own block. Then it's fine. Maybe it's more efficient to use a block size > 1 page.
Bucket allocator advantages:
Bucket allocator disadvantages:
Any allocator that's based on a shared global data structure (a freelist or array of freelists) will cause trouble with multiple threads. That global structure is shared data, and if we protect it with a lock then we can't malloc or free in multiple threads in parallel.
This story makes the most sense if we start from a single global traditional freelist.
Advantages:
Disadvantages:
The word "arena" in memory allocators is used to refer to the data structure associated with a complete self-sufficient allocator: the free list or array of free lists.
By having several arenas, we can allow each arena to have its own mutex. With enough arenas (e.g. 1 per thread), this can avoid lock contention entirely except on objects that are allocated in one thread and freed in another.
This mechanism works pretty well with buckets. The block metadata mechanism allows freeing to work independent of which arena the block is associated with, as long as the block metadata includes a mutex.
ptmalloc: The GNU libc allocator
tmalloc: Google's Thread Caching Malloc
jemalloc: Facebook's Multi-Arena Allocator
Here's the jemalloc video: https://www.youtube.com/watch?v=RcWp5vwGlYU
One the key tricks that makes jemalloc good is the madvise syscall with the MADV_DONTNEED flag. This tells the kernel that for one or more pages of memory the program wants to keep the virtual pages but doesn't need the data anymore. Future access to these pages may allocate new physical pages of zeros.
Another neat tool is MADV_HUGEPAGE. This requests that the kernel use 2MB pages rather than 4k pages. This means each TLB entry handles 2MB rather than 4k, which can speed things up significantly.
There is also support for 1GB pages, but those are too big for most use cases currently. Maybe for machines with TB of RAM...
Unfortunatly, MADV_DONTNEED doesn't work with huge pages.