[Bug 49113] Wine heap performs badly when multiple threads are concurrently allocating or freeing memory

Thu May 7 14:18:50 CDT 2020

https://bugs.winehq.org/show_bug.cgi?id=49113

--- Comment #9 from Rémi Bernon <rbernon at codeweavers.com> ---
Created attachment 67097
  --> https://bugs.winehq.org/attachment.cgi?id=67097
Rbtree heap optimization

(In reply to Zebediah Figura from comment #6)
> Rémi, that graph is detailed but kind of difficult to get an overall picture
> from. I was hoping for instead an absolute measurement of how long loading
> time is, with both heap managers. Adding Staging also might be good...

Well, the graph shows that on the x axis. Each point is a frame time and
there's a fixed number of frames until the menu opens, which corresponds to the
last spike on the right. Both graph shows the exact same activity, but the LFH
version is slightly more compressed - thus shorter loading time.

Then to be completely honest, it's pretty hard to measure precisely and it also
varies a lot. This graph was using the shortest loading time I could get with
both heap managers and every run was different.

(In reply to Dmitry Timoshkov from comment #3)
> This seems to be going in the wrong direction (is the actual problem due to
> locking primitives being inefficient?) since the whole effort has been driven
> by an artificial tests, and as the result there's no visible improvement for
> the real world applications. On the contrary Sebastian's patchset in the
> staging tree was based on the research and proper heap manager design, and as
> a result provided huge performance improvements for real world applications.
> 
> =============================================================================
> =
> https://dev.wine-staging.com/patches/submission/145/
> 
> New comments by Sebastian Lackner (slackner):
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~
> 
> Sorry for the delay, but such a complicated patchset took a bit more time to
> evaluate. During the last few weeks, Michael Müller wrote tools to evaluate
> heap allocation performance. These tools allow to record the heap allocation
> pattern for various applications, and to replay them with different heap
> implementations. The result (tested with various applications and games)
> confirms what I already suspected:
> 

So, in order to be fair I spent some time doing the same thing today. I don't
know any application that actually suffers from bad allocation patterns, so I
recorded the heap allocations when running Steam for Windows, then after
starting a game and quitting everything after a bit.

Then, I replayed the allocations as quickly as possible, but using a single
thread of execution -- just to get that out of the equation and because the
allocations were not replayed faithfully in time, but that could be another
experiment to do.

I also spent some time studying the staging patches to determine what the
optimizations were exactly, in order to eventually try to clean up the patches
and upstream them. My understanding is that there's four different
optimizations in the patch:

* The number of free list buckets is increased to 128 (it's somewhere around 16
or 32 in mainline depending on the pointer size).
* The largest buckets are replaced with an rbtree using block size as key.
* The list buckets empty status is cached in a bitmask array (which makes
little sense on its own but helps mitigate the impact of increasing the number
of free buckets).
* The list buckets are unlinked from each other and directly use a struct list
instead of a full arena header.

Then, I replayed my recorded allocations, with the individual optimizations
split and added separately, as well as the other versions, and the results are
as follows:

Wine: ~13.5s
+ increase freelist count: ~14.5s
+ rbtree for large blocks: ~12s
+ cache freelist state: ~13.5s
+ struct list direct use: ~13.5s
Wine + staging patch: ~12s
Low fragmentation heap: ~10s

I'm not going to conclude much based on just this small experiment, but I think
the optimizations in staging aren't that sophisticated. It's optimizing the
worst cases by making the lookup of large blocks faster thanks to the rbtree,
and moving some smaller sizes to individual free lists. The rbtree optimization
can make sense, but it's also not very CPU friendly (as are the linked lists
anyway).

The free list count increase on the other hand makes little sense to me. It's
hardcoding some value, without trying to optimize the distribution. Mainline
doesn't do much but there is at least a bit of categorization with a few larger
sizes buckets. The patch drops all this and simply increases the number of
small size buckets, relying on the rbtree and the state cache to handle the
size categories and mitigate the induced CPU cache load. I mean, it's possible
that it was giving the best results for /some/ other experiment but unless
there's some data to back it up, for now I think it's just tweaking numbers.

I'm attaching the extracted rbtree optimization for reference because it seems
useful. I'm also very interested to know if there's some specific applications
suffering from bad allocation patterns.

-- 
Do not reply to this email, post in Bugzilla using the
above URL to reply.
You are receiving this mail because:
You are watching all bug changes.