[RFC PATCH 00/11] Thread-local heap implementation.
Rémi Bernon
rbernon at codeweavers.com
Wed May 6 08:42:48 CDT 2020
On 5/6/20 2:38 PM, Dmitry Timoshkov wrote:
> Rémi Bernon <rbernon at codeweavers.com> wrote:
>
>> This is a heap implementation based on thread-local structures, that I
>> have been keeping locally for quite some time. The goal was to improve
>> Wine's heap performance in multithreaded scenarios and see if it could
>> help performance in some games.
>>
>> The good news is that this implementation is performing well, according
>> to third-party heap micro benchmarks. The bad news is that it doesn't
>> change performance much in general, as allocations are usually scarse
>> during gameplay. I could still see improvements for loading times, and
>> less stalling as well.
>
> Have you looked at the Sebastian's heap improvements patches in the staging
> tree? According to Sebastian's and Michael's testing "The new heap allocator
> uses (inspired by the way how it works on Windows) various fixed-size free
> lists, and a tree data structure for large elements. With this implementation,
> I get up to [b]60%[/b] improvement for apps with the "bad allocation pattern",
> and up to [b]15%[/b] improvement in the "good case"."
>
I believe these patches are also shipped in Proton, and although it's
performing better than the upstream heap there's still a lot of
contention when multiple threads try to (de)allocate at the same time.
For reference I used https://github.com/mjansson/rpmalloc-benchmark as
raw performance measurement. They start a given number of threads, with
each thread doing a fixed number of iterations. Every iteration the
thread allocates and frees a certain amount of memory, eventually with
cross-thread allocation every other iteration, then does a given number
of computation using the allocated buffers as storage. Then it measures
the time it took to do all these operations.
For instance, with these benchmark parameters as indicated on their
sample result page[1]:
<num threads> 0 0 2 20000 50000 5000 16 1000
I have the following results with the various implementations and using
two concurrent threads (the higher the number of threads, the worse it
gets, especially for the default Wine heap):
* linux crt: 5675754 memory ops/CPU second, 53% overhead
* wine rpmalloc: 19700003 memory ops/CPU second, 131% overhead
* wine upstream: 248333 memory ops/CPU second, 62% overhead
* wine staging: 914004 memory ops/CPU second, 61% overhead
* wine lfh: 10651300 memory ops/CPU second, 114% overhead
(linux crt is for running the benchmark natively, wine rpmalloc is their
rpmalloc benchmark cross compiled and executed with wine, wine lfh is
this patch series)
So the staging patches perform much better than the default heap, but
it's still pretty dramatic when multiple threads are involved.
My opinion is that the heap critical section being hold on every
(de)allocation it is going to be very hard to have see big improvements,
regardless of the inner allocation algorithm.
[1]
https://github.com/mjansson/rpmalloc-benchmark/blob/master/BENCHMARKS.md#random-size-in-16-1000-range
2 0 0 2 20000 50000 5000 16 1000
--
Rémi Bernon <rbernon at codeweavers.com>
More information about the wine-devel
mailing list