Each partition in PartitionAlloc uses a single, central,
slab-based allocator to conserve memory, with a minimal per-thread cache in front for scaling to multi-threaded workloads. This simplicity also pays performance dividends: we’ve extensively profiled and aggressively trimmed the allocator’s fast path, improving thread-local storage access, locks, reducing cache line fetches, and removing branches.
PartitionAlloc pre-reserves slabs of virtual address space. They are gradually backed by physical memory, as allocation requests arrive. Small and medium-sized allocations are grouped in geometrically-spaced, size-segregated buckets, e.g. [241; 256], [257; 288]. Each slab is split into regions (called “slot spans”) that satisfy allocations (“slots”) from only one particular bucket, thereby increasing cache locality while lowering fragmentation. Conversely, larger allocations don’t go through the bucket logic and are fulfilled using the operating system’s primitives directly (mmap() on
POSIX systems, and VirtualAlloc() on Windows).
This central allocator is protected by a single per-partition lock. To mitigate the scalability problem arising from contention, we add a small, per-thread cache of small slots in front, yielding a three-tiered architecture:
The first layer (Per-thread cache) holds a small amount of slots belonging to smaller and more commonly used buckets. Because these slots are stored per-thread, they can be allocated without a lock and only requiring a faster
thread-local storage lookup, improving cache locality in the process. The per-thread cache has been tailored to satisfy the majority of requests by allocating from and releasing memory to the second layer in batches, amortizing lock acquisition, and further improving locality while not trapping excess memory.
The second layer (Slot span free-lists) is invoked upon a per-thread cache miss. For each bucket size, PartitionAlloc knows a slot span with free slots associated with that size, and captures a slot from the free-list of that span. This is still a fast path, but slower than per-thread cache as it requires taking a lock. However, this section is only hit for larger allocations not supported by per-thread cache, or as a batch to fill the per-thread cache.
Finally, if there are no free slots in the bucket, the third layer (Slot span management) either carves out space from a slab for a new slot span, or allocates an entirely new slab from the operating system, which is a slow but very infrequent operation.
The overall performance and space-efficiency of the allocator hinges on the many tradeoffs across its layers such as how much to cache, how many buckets, and memory reclaiming policy. Please refer to
PartitionAlloc to learn more about the design.
All in all, we hope you will enjoy the additional memory savings and performance improvements brought by PartitionAlloc, ensuring a safer, leaner, and faster Chrome for users on Earth and in outer space alike. Stay tuned for further improvements, and support of more platforms coming in the near future.
Posted by Benoît Lizé and Bartek Nowierski, Chrome Software Engineers