[Bug 50680] Detroit: Become Human crashes at launch with ntdll-NtAlertThreadByThreadId

WineHQ Bugzilla wine-bugs at winehq.org
Tue Nov 16 19:42:08 CST 2021


https://bugs.winehq.org/show_bug.cgi?id=50680

Zebediah Figura <z.figura12 at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Fixed by SHA1|                            |45230b51db703933dc6108c526a
                   |                            |9eb872416981a
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #24 from Zebediah Figura <z.figura12 at gmail.com> ---
Yeah.

Okay, so here is my working theory:

The game misuses condition variables. I've disassembled three or four different
call sites and I see code like this:

EnterCriticalSection(&apple->cs);
if (apple->x || SleepConditionVariableCS(&apple->cv, &apple->cs, INFINITE) !=
ERROR_TIMEOUT)
    --apple->x;
LeaveCriticalSection(&apple->cs);

Which is wrong for multiple reasons, but the most important one is that it
doesn't call SleepConditionVariableCS() in a loop. Which means that a stolen or
spurious wakeup will result in incorrect behaviour.

(Granted, it's possible that I'm wrong, and that the relevant functions are
being called in a loop somehow, or that "x" is used in a way that is ultimately
safe against spurious wakeups. I don't think this is likely, especially given
the other errors made around condition variables [e.g. in one place they call
SleepConditionVariableCS() without holding the CS], but it's always possible.)

I suspect that it works consistently on Windows because the Windows
implementation of condition variables does not cause spurious wakeups (at
least, not if there's only one waiting thread). This fact is hard to
corroborate. I ended up writing a test that I think roughly checks whether
spurious wakeups matter in the way that Detroit cares about, which I'll attach.
I get spurious wakeups every 10 seconds or so on Windows. On the other hand,
they show up pretty constantly on Linux, with the tid alert patches.

The reason for the behaviour we've seen, then, is as follows:

(1) With current upstream wine, we use the condition variable word as a futex
directly. As far as I can tell, the Linux implementation of futexes doesn't
actually cause any spurious wakeups under "normal" circumstances. I'm not sure
of this, but I can't find any reason in the code that it does.

(2) With current wine-staging, i.e. with all of the tid alert patches applied,
we implement condvars on top of win32 futexes on top of tid alerts (on top of
linux futexes, but that ends up not mattering). There are basically three ways
we can get a wakeup without going all the way into the scheduler:

  (a) Return early from RtlSleepConditionVariable*() because the condition
didn't match. This isn't a problem, because we're not queued, and thus a
subsequent wake will wake nothing.

  (b) Return early from RtlWaitOnAddress() because the address didn't match.
This can be a problem, because we *are* queued, and if a subsequent wake
happens before we unqueue ourselves, we get wake type (c), which is...

  (c) A "buffered" wake from a previous RtlWaitOnAddress() call. In the above
scenario, the waking thread calls NtAlertThreadByThreadId() after the waiter is
woken up (either from a type (b) wake or a timeout), but before it unqueues
itself. TID alerts are basically just fancy events, so the wake is buffered
until the next call to NtWaitForAlertByThreadId().
      In theory Windows suffers from this as well. In practice my attempts to
reproduce it have been met with failure, and I'm not sure why, although I have
at least one guess.

(3) With patch 0010 applied by itself, we don't get the problem from (2). What
happens instead is that we get spurious wakes due to hash collision. In order
to verify this, I tried increasing the hash table size, to avoid collision,
hence the patch from comment 21. And this indeed made the spurious wakes go
away, and let the game start working.

Because we currently use spinlocks in the PE side anyway, the easiest way to
avoid buffered wakes is actually to do the comparison inside the spinlock (much
like we currently do upstream on the Unix side when futexes aren't available).
This is what was committed in wine-staging as
45230b51db703933dc6108c526a9eb872416981a, and it seems to consistently fix this
bug.

-- 
Do not reply to this email, post in Bugzilla using the
above URL to reply.
You are receiving this mail because:
You are watching all bug changes.



More information about the wine-bugs mailing list