[RFC PATCH 00/11] Thread-local heap implementation.

Wed May 6 11:09:13 CDT 2020

On 5/6/20 5:32 PM, Dmitry Timoshkov wrote:
> Rémi Bernon <rbernon at codeweavers.com> wrote:
> 
>>>> This is a heap implementation based on thread-local structures, that I
>>>> have been keeping locally for quite some time. The goal was to improve
>>>> Wine's heap performance in multithreaded scenarios and see if it could
>>>> help performance in some games.
>>>>
>>>> The good news is that this implementation is performing well, according
>>>> to third-party heap micro benchmarks. The bad news is that it doesn't
>>>> change performance much in general, as allocations are usually scarse
>>>> during gameplay. I could still see improvements for loading times, and
>>>> less stalling as well.
>>>
>>> Have you looked at the Sebastian's heap improvements patches in the staging
>>> tree? According to Sebastian's and Michael's testing "The new heap allocator
>>> uses (inspired by the way how it works on Windows) various fixed-size free
>>> lists, and a tree data structure for large elements. With this implementation,
>>> I get up to [b]60%[/b] improvement for apps with the "bad allocation pattern",
>>> and up to [b]15%[/b] improvement in the "good case"."
>>>
>>
>> I believe these patches are also shipped in Proton, and although it's
>> performing better than the upstream heap there's still a lot of
>> contention when multiple threads try to (de)allocate at the same time.
>>
>> For reference I used https://github.com/mjansson/rpmalloc-benchmark as
>> raw performance measurement. They start a given number of threads, with
>> each thread doing a fixed number of iterations. Every iteration the
>> thread allocates and frees a certain amount of memory, eventually with
>> cross-thread allocation every other iteration, then does a given number
>> of computation using the allocated buffers as storage. Then it measures
>> the time it took to do all these operations.
>>
>> For instance, with these benchmark parameters as indicated on their
>> sample result page[1]:
>>
>>     <num threads> 0 0 2 20000 50000 5000 16 1000
>>
>> I have the following results with the various implementations and using
>> two concurrent threads (the higher the number of threads, the worse it
>> gets, especially for the default Wine heap):
>>
>> * linux crt:      5675754 memory ops/CPU second, 53% overhead
>> * wine  rpmalloc: 19700003 memory ops/CPU second, 131% overhead
>> * wine  upstream: 248333 memory ops/CPU second, 62% overhead
>> * wine  staging:  914004 memory ops/CPU second, 61% overhead
>> * wine  lfh:      10651300 memory ops/CPU second, 114% overhead
> 
> Do you have the numbers for various Windows flavours on the same hardware?
> 

I only have Windows 10 physically installed. The results for the same 
set of parameters are roughly equivalent to these patches:

   11977625 memory ops/CPU second, 106% overhead

-- 
Rémi Bernon <rbernon at codeweavers.com>