TestBot: A dive into the Windows timers
Francois Gouget
fgouget at codeweavers.com
Tue Mar 24 22:48:05 CDT 2020
TL;DR;
* Windows is bad at hitting precise timing targets.
* The 15.6 ms Windows timer granularity can have an impact on the tests
reliability.
* But some unknown factor is causing large delays which are the source
of most timing related failures.
* The same delays probably also cause the wait timed out failures.
* Avoiding these failures in the timer tests is tricky.
-> Find the source of these delays and figure out how to avoid them.
My test subject for this dive was test_tp_timer() in
ntdll:threadpool, and more specifically the loop which we expect to run
in 500 to 750 ms:
https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpool.c#l1401
* Recurring timers are nice because one knows exactly how often they
should generate an event which makes it easy to see how badly and how
often they miss.
This is unlike most WaitForSingleObject() calls where one does not
know how much time will elapse and all one gets is the unpredictable
one-off timeout.
* Wine on Linux is waaay better than Windows at ensuring the timer
events arrive precisely on time. Even in a QEmu VM on an otherwise
busy machine, every event arrives within 1 ms of the expected time
(so in this case, in Wine we go through the loop once every 200 ms and
rarely 199 or 201 ms).
* On Windows the default timer resolution is 15.625 ms. So asking for a
200 ms timer interval as in this test is asking for trouble: Windows
will round that to 13 * 15.625 ms = 203 ms and then occasionally make
the interval shorter so the average stays around 200 ms. So we
typically get a pattern of 203 203 203 203 188 ms and again 203...
* Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes
get into a rut, alternating between 172 ms and 219 ms for instance. So
if I tweak the test to go through more loops I'd sometimes get only 4
out of 14 events arriving within 190-210 ms of the previous one!
Whether this happens is heavily influenced by the other processes
running on the machine, such as playing a video in IE.
* This means that based on just a few events one cannot expect the
interval between most events to fall within a narrow range. So here
for instance if the acceptable interval is 190-210 ms and the first
interval is instead 237 ms, then the next one will necessarily be out
of range too, and likely the one after that too. So expecting 2 out
of 3 intervals to be within the range is no more reliable than
checking just one interval.
* The timer resolution can be changed with NtSetTimerResolution() and
increased to 0.5 ms. But that does not change the results: one still
gets 203, 203, 203, 203, 188 ms. My understanding is that increasing
the timer resolution really only impacts timers that request a higher
resolution. But that means testing a different API.
* At least increasing the timer resolution seems to prevent Windows from
getting stuck alternating between 172 and 219 ms.
* Interestingly if one starts a Disk Cleanup and then runs "Clean up
system files", the analysis of the system files drops the timer
interval to 0.5 ms! So trying to see the impact of disk activity can
modify the results in unexpected ways.
* One can check the current timer resolution with ClockRes command line
tool:
https://docs.microsoft.com/en-us/sysinternals/downloads/clockres
* When the delay is greater than the timer interval, e.g. 437 ms, the
timer does not try to compensate for the lost events. But it does try
to compensate for the fractional interval remainder. So one sees:
Expected Actual
Loop Time Time Interval Lateness
1 200 437 437 200+37
2 400 609 172 200+9
3 600 812 203 200+12
4 800 1000 188 200
* In QEmu, when the timer misses it often misses big: 437 ms, 687 ms,
even 1469 ms. So most of the time expecting three events to take about
3 intervals does not help with reliability because the timer does not
try to compensate the missed events. So at the end it will still be
off by one interval (200 ms) or more.
* I could not reproduce these big misses on the Windows 8.1 on
cw-rx460 machine (i.e. real hardware).
* Conclusion:
- Because Windows constantly adjusts the timer interval so that they
are only accurate *on average*, testing the total time for N events
will be more reliable than testing the events individually.
- But the occasional big delays mean any timing tolerances below
~1700 ms will be somewhat unreliable :-(
- For timers one could check if a big delay happened in one of the
loops and either account for it or reset the timer and retry in the
hope no delay will occur then. But that's overly complex.
- The same big delays are probably also the cause of the
WaitForSingleObject() failures when the timeout is < 2000 ms.
- The timing in Wine in Debian VMs is very reliable. Yet it is likely
that QEmu is somehow responsible for the timing issues in the
Windows VMs.
- The best lead at this point is to check whether some VM clock
setting is causing this. But some are also need to prevent reverts
from taking forever[1].
(w8)
<clock offset='localtime'>
<timer name='rtc' tickpolicy='delay'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
vs. (w1064)
<hyperv>
<relaxed state='on'/>
<vapic state='on'/>
<spinlocks state='on' retries='8191'/>
</hyperv>
<clock offset='localtime'>
<timer name='rtc' tickpolicy='catchup' track='guest'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
<timer name='hypervclock' present='yes'/>
</clock>
vs. (fgtbdebian10)
<clock offset='utc'/>
- Otherwise try CPU pinning. But Linux seems to do really well without
it...
[1] https://bugs.launchpad.net/qemu/+bug/1505041
https://bugs.launchpad.net/qemu/+bug/1174654
--
Francois Gouget <fgouget at codeweavers.com>
More information about the wine-devel
mailing list