[RFC PATCH 00/11] Thread-local heap implementation.

Wed May 6 10:32:30 CDT 2020

Rémi Bernon <rbernon at codeweavers.com> wrote:

> >> This is a heap implementation based on thread-local structures, that I
> >> have been keeping locally for quite some time. The goal was to improve
> >> Wine's heap performance in multithreaded scenarios and see if it could
> >> help performance in some games.
> >>
> >> The good news is that this implementation is performing well, according
> >> to third-party heap micro benchmarks. The bad news is that it doesn't
> >> change performance much in general, as allocations are usually scarse
> >> during gameplay. I could still see improvements for loading times, and
> >> less stalling as well.
> > 
> > Have you looked at the Sebastian's heap improvements patches in the staging
> > tree? According to Sebastian's and Michael's testing "The new heap allocator
> > uses (inspired by the way how it works on Windows) various fixed-size free
> > lists, and a tree data structure for large elements. With this implementation,
> > I get up to [b]60%[/b] improvement for apps with the "bad allocation pattern",
> > and up to [b]15%[/b] improvement in the "good case"."
> > 
> 
> I believe these patches are also shipped in Proton, and although it's 
> performing better than the upstream heap there's still a lot of 
> contention when multiple threads try to (de)allocate at the same time.
> 
> For reference I used https://github.com/mjansson/rpmalloc-benchmark as 
> raw performance measurement. They start a given number of threads, with 
> each thread doing a fixed number of iterations. Every iteration the 
> thread allocates and frees a certain amount of memory, eventually with 
> cross-thread allocation every other iteration, then does a given number 
> of computation using the allocated buffers as storage. Then it measures 
> the time it took to do all these operations.
> 
> For instance, with these benchmark parameters as indicated on their 
> sample result page[1]:
> 
>    <num threads> 0 0 2 20000 50000 5000 16 1000
> 
> I have the following results with the various implementations and using 
> two concurrent threads (the higher the number of threads, the worse it 
> gets, especially for the default Wine heap):
> 
> * linux crt:      5675754 memory ops/CPU second, 53% overhead
> * wine  rpmalloc: 19700003 memory ops/CPU second, 131% overhead
> * wine  upstream: 248333 memory ops/CPU second, 62% overhead
> * wine  staging:  914004 memory ops/CPU second, 61% overhead
> * wine  lfh:      10651300 memory ops/CPU second, 114% overhead

Do you have the numbers for various Windows flavours on the same hardware?

-- 
Dmitry.