[RFC PATCH 00/11] Thread-local heap implementation.

Wed May 6 08:42:48 CDT 2020

On 5/6/20 2:38 PM, Dmitry Timoshkov wrote:
> Rémi Bernon <rbernon at codeweavers.com> wrote:
> 
>> This is a heap implementation based on thread-local structures, that I
>> have been keeping locally for quite some time. The goal was to improve
>> Wine's heap performance in multithreaded scenarios and see if it could
>> help performance in some games.
>>
>> The good news is that this implementation is performing well, according
>> to third-party heap micro benchmarks. The bad news is that it doesn't
>> change performance much in general, as allocations are usually scarse
>> during gameplay. I could still see improvements for loading times, and
>> less stalling as well.
> 
> Have you looked at the Sebastian's heap improvements patches in the staging
> tree? According to Sebastian's and Michael's testing "The new heap allocator
> uses (inspired by the way how it works on Windows) various fixed-size free
> lists, and a tree data structure for large elements. With this implementation,
> I get up to [b]60%[/b] improvement for apps with the "bad allocation pattern",
> and up to [b]15%[/b] improvement in the "good case"."
> 

I believe these patches are also shipped in Proton, and although it's 
performing better than the upstream heap there's still a lot of 
contention when multiple threads try to (de)allocate at the same time.

For reference I used https://github.com/mjansson/rpmalloc-benchmark as 
raw performance measurement. They start a given number of threads, with 
each thread doing a fixed number of iterations. Every iteration the 
thread allocates and frees a certain amount of memory, eventually with 
cross-thread allocation every other iteration, then does a given number 
of computation using the allocated buffers as storage. Then it measures 
the time it took to do all these operations.

For instance, with these benchmark parameters as indicated on their 
sample result page[1]:

   <num threads> 0 0 2 20000 50000 5000 16 1000

I have the following results with the various implementations and using 
two concurrent threads (the higher the number of threads, the worse it 
gets, especially for the default Wine heap):

* linux crt:      5675754 memory ops/CPU second, 53% overhead
* wine  rpmalloc: 19700003 memory ops/CPU second, 131% overhead
* wine  upstream: 248333 memory ops/CPU second, 62% overhead
* wine  staging:  914004 memory ops/CPU second, 61% overhead
* wine  lfh:      10651300 memory ops/CPU second, 114% overhead

(linux crt is for running the benchmark natively, wine rpmalloc is their 
rpmalloc benchmark cross compiled and executed with wine, wine lfh is 
this patch series)

So the staging patches perform much better than the default heap, but 
it's still pretty dramatic when multiple threads are involved.

My opinion is that the heap critical section being hold on every 
(de)allocation it is going to be very hard to have see big improvements, 
regardless of the inner allocation algorithm.

[1] 
https://github.com/mjansson/rpmalloc-benchmark/blob/master/BENCHMARKS.md#random-size-in-16-1000-range

2 0 0 2 20000 50000 5000 16 1000
-- 
Rémi Bernon <rbernon at codeweavers.com>