Optimizing synchronization primitives, again

Wed Mar 4 22:26:13 CST 2020

(Or: "There's a limit to how many constraints you can add to a problem
before it really is impossible.")

Hello all,

I'm once again going to bring up the topic of optimizing synchronization
primitives. This is partly questions for Alexandre, partly questions for
those people interested in the futex-based backend ("fsync"), partly an
attempt to give a better and more accessible summary than I gave at
WineConf last year, and partly a call for ideas, because I'm fast
running out of them.

I'd like to first try to spell out the constraints I'm working with,
make sure I have some of them right, and request clarification on
others, because a misunderstanding of those constraints has bitten me
before. I've divided them into three broad categories: correctness,
security/robustness, and performance.

------------
Correctness:
------------

1. All basic sync operations work as documented and are atomic.

2. An operation need not be perfectly serialized so long as the results
of that operation interleaved with any other are indistinguishable from
serialized execution in either order.

3. If a thread is waiting on an event, PulseEvent() should always wake
up that thread. This constraint is documented to be not always true on
Windows [1], but whether said race condition ever happens in practice is
not clear. It certainly must be rare at least. Both the "esync" and
"fsync" patch sets violate this constraint, and in practice many
applications are affected. Most use timeSetEvent(), presumably so that
some procedure may run regularly every N ms, and so missing a wakeup on
rare occasion would likely not cause severe damage.

4. If a thread waits repeatedly on a manual-reset event, PulseEvent() on
that event should not wake up the thread more than once. Both "esync"
and "fsync" currently violate this constraint.

5. WaitForSingleObject() should act as a fence. In particular, if a
thread executes the following code in a loop:

    WaitForSingleObject(event, INFINITE);
    ResetEvent(event);

  then a single call to SetEvent() must only allow one iteration of the
loop. I mention this constraint because a certain optimization to the
"esync" patch set violated it, and caused a game (RiME) to misbehave. I
have not verified that this constraint in particular caused the
misbehaviour.

6. Any operation that signals an object *must* wake up a sleeping thread
if that thread can be woken up. For example, if thread A is waiting on
event E and thread B executes SetEvent(E) followed by ResetEvent(E),
thread A must wake up. Both "esync" and "fsync" currently violate this
constraint. At least one game (I think it was Bioshock) used SetEvent()
followed by ResetEvent() in similar fashion to PulseEvent(), and behaved
poorly under "esync".

7. [clarification needed] Similarly to the above, if thread A is waiting
on all of events E and F, F is signaled, and thread B executes
SetEvent(E) followed by ResetEvent(F), thread A must wake up. Both
"esync" and "fsync" currently violate this constraint. From what I
understand of the Windows kernel (mostly derived from [2]), I'm not sure
Windows actually satisfies this constraint (it would be very difficult,
if at all possible, to do this without one single lock around all sync
operations, which recent versions of Windows don't do.) I haven't yet
encountered any games that break due to this with "esync" or "fsync",
but it may also be very hard to trigger. Is this a constraint we have to
follow?

----------------------
Security / robustness:
----------------------

I would really appreciate a detailed description of what our
security-related constraints are, because they're not obvious to me at
all. I previously was told that shared memory was absolutely forbidden
and therefore tried to work within that constraint, and was surprised to
learn at WineConf that this was not the case. The obvious constraint to
me is that a user should be prevented from making anyone else's life
worse, but is free to make their own life worse; however, I think
previous writing by Alexandre implies more strict constraints than this,
and I don't see an obvious Schelling fence between these two extremes.)

8. An object belonging to a given process belonging to a given POSIX
user can only be manipulated (i.e. have its state changed or wake up the
sleeping process) by processes belonging to other POSIX users if the
Win32 ACL permits it. This constraint is roughly meaningless as long as
Wine doesn't support multiple users, but any solution we ultimately come
up with should account for it.

9. [clarification requested] In particular, it also must not be possible
to change the state of an object to an invalid state (e.g. a semaphore
whose count is too high), or to a state that would not be possible with
pure Win32 functions (e.g. change the owner of a mutex to a process
owned by another user).

10. [clarification needed] As constraint 8, but in regard to processes
belonging to the same user. Is this a constraint we have to follow? If
so, why? It will always be possible for a sufficiently determined
process to alter the state, including to invalid states, if by no other
means than using ptrace or injecting code.

11. [clarification needed] As constraint 9, but in regard to processes
belonging to the same POSIX user. Is this a constraint we have to follow?

12. [clarification requested] It should not be possible for a process to
"accidentally" disrupt the functionality of another process if such
disruption would not be possible purely through NT functions of "known"
objects. This is a lesser version (a proper subset) of constraint 8 or
10. This is vague, and hence clarification is appreciated. But for
example, supposing that the process never uses host syscalls, Win32
debugger functions, or duplicates a handle outside of normal
functionality, then any crash, memory corruption, or accidental misuse
of Win32 functions should not affect processes which have never shared a
handle, or processes which share a handle but do not use it, etc. This
constraint is mentioned largely because it significantly restricts how
we can use shared memory.

13. It should not be possible for a process to "accidentally" disrupt
the functionality of the wineserver, i.e. it would take intentional use
of host functions (interrupts, etc.) to do so.

------------
Performance:
------------

This is a bit weird, because, as far as Wine's concerned, any solution
is good. But as far as I'm concerned, if it's not at least as good as
the solutions we have, I still need to maintain the solutions we have.
Maintaining those trees outside of Wine takes up more time than I really
want it to.

14. SetEvent(), ResetEvent(), ReleaseMutex(), ReleaseSemaphore(), and
wait-any for mutexes, semaphores, and events should all be fast.
PulseEvent(), NtQuery*(), wait-all, and waiting on other objects
(threads, processes, named pipes, ...) can be slow. I'm not sure where
to put timers. "slow" here means at least as slow as they currently are.
"fast" means at least as fast as they are with "fsync", which is
currently the fastest solution.

15. Cross-process objects must be fast. Pierre-Loup A. Griffais informed
me during my presentation last WineConf that processes are increasingly
relying on this.

16. We should have a solution for MacOS that's at least as good as the
"esync"-based pipe backend. (This patch set is currently used in some
builds of CrossOver. It's not part of Staging or in a particularly
accessible place—though of course it's distributed as part of
CrossOver's source—and I apologize for that. I can provide it
independently on request; thus far it hasn't been a high priority).

=========
SOLUTIONS
=========

* I won't go into detail on esync. Its function and its limitations have
been discussed before and are documented along with the patch set. fsync
mostly works the same way, but on top of futexes instead of eventfds.

* Broadly speaking, do scheduling in user space, using shared memory to
satisfy constraint 15 (cross-process performance). This allows us to
satisfy every correctness constraint. However, robustness is a lot
harder to satisfy. For example:

  - Any such scheduler would almost certainly have to take locks around
a wait list. However, that means the server would also have to take
locks to abandon mutices. If a process crashes with a lock held, the
server hangs, violating constraint 13.

  - Any process has access to any other process's wait list and can
"accidentally" break it quite badly, probably violating constraint 12.

  - An object's state would presumably be in shared memory. We can't
selectively prevent *which* values can be written, so this violates
constraint 11.

Additionally, Pierre-Loup A. Griffais has expressed doubts that such an
approach can be as performant as "fsync", because (to oversimplify?) the
overhead of a user-mode scheduler is combined with the overhead of
kernel-mode scheduling.

* Extend an existing kernel-level interface (probably eventfd) or
introduce a new set of kernel-level synchronization primitives which
will essentially look exactly like the current wineserver interface.
This allows us to satisfy every correctness *and* robustness constraint.
In theory it will be exactly as performant as Windows. It is more
performant than "esync" in some places, equally performant in others,
and worse in still others (actually, in one specific place—waiting for
an already signaled manual-reset event with esync makes no system
calls.) It is not as performant as "fsync", and it is almost certain
that the difference is something that people will still care about,
which means that I'd still have to maintain fsync forever. Also, this
won't help for Mac, so I'd have to maintain esync at least until Apple
decides to break Wine in a way we can't fix. (I don't think it'll take
very long.)

* That's... kind of it. I'm out of ideas. I'd appreciate any others.

ἔρρωσθε,
Zeb

[1]
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-pulseevent
[2]
https://channel9.msdn.com/Shows/Going+Deep/Arun-Kishan-Farewell-to-the-Windows-Kernel-Dispatcher-Lock