TestBot: A dive into the Windows timers
Zebediah Figura
z.figura12 at gmail.com
Tue Mar 24 23:11:55 CDT 2020
On 3/24/20 10:48 PM, Francois Gouget wrote:
>
> TL;DR;
> * Windows is bad at hitting precise timing targets.
> * The 15.6 ms Windows timer granularity can have an impact on the tests
> reliability.
> * But some unknown factor is causing large delays which are the source
> of most timing related failures.
> * The same delays probably also cause the wait timed out failures.
> * Avoiding these failures in the timer tests is tricky.
> -> Find the source of these delays and figure out how to avoid them.
>
>
> My test subject for this dive was test_tp_timer() in
> ntdll:threadpool, and more specifically the loop which we expect to run
> in 500 to 750 ms:
>
> https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpool.c#l1401
>
>
> * Recurring timers are nice because one knows exactly how often they
> should generate an event which makes it easy to see how badly and how
> often they miss.
> This is unlike most WaitForSingleObject() calls where one does not
> know how much time will elapse and all one gets is the unpredictable
> one-off timeout.
>
> * Wine on Linux is waaay better than Windows at ensuring the timer
> events arrive precisely on time. Even in a QEmu VM on an otherwise
> busy machine, every event arrives within 1 ms of the expected time
> (so in this case, in Wine we go through the loop once every 200 ms and
> rarely 199 or 201 ms).
>
> * On Windows the default timer resolution is 15.625 ms. So asking for a
> 200 ms timer interval as in this test is asking for trouble: Windows
> will round that to 13 * 15.625 ms = 203 ms and then occasionally make
> the interval shorter so the average stays around 200 ms. So we
> typically get a pattern of 203 203 203 203 188 ms and again 203...
>
> * Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes
> get into a rut, alternating between 172 ms and 219 ms for instance. So
> if I tweak the test to go through more loops I'd sometimes get only 4
> out of 14 events arriving within 190-210 ms of the previous one!
> Whether this happens is heavily influenced by the other processes
> running on the machine, such as playing a video in IE.
>
> * This means that based on just a few events one cannot expect the
> interval between most events to fall within a narrow range. So here
> for instance if the acceptable interval is 190-210 ms and the first
> interval is instead 237 ms, then the next one will necessarily be out
> of range too, and likely the one after that too. So expecting 2 out
> of 3 intervals to be within the range is no more reliable than
> checking just one interval.
Allowing for more error than 10ms seems reasonable to me, even by an
order of magnitude.
>
> * The timer resolution can be changed with NtSetTimerResolution() and
> increased to 0.5 ms. But that does not change the results: one still
> gets 203, 203, 203, 203, 188 ms. My understanding is that increasing
> the timer resolution really only impacts timers that request a higher
> resolution. But that means testing a different API.
>
> * At least increasing the timer resolution seems to prevent Windows from
> getting stuck alternating between 172 and 219 ms.
>
> * Interestingly if one starts a Disk Cleanup and then runs "Clean up
> system files", the analysis of the system files drops the timer
> interval to 0.5 ms! So trying to see the impact of disk activity can
> modify the results in unexpected ways.
>
> * One can check the current timer resolution with ClockRes command line
> tool:
> https://docs.microsoft.com/en-us/sysinternals/downloads/clockres
>
> * When the delay is greater than the timer interval, e.g. 437 ms, the
> timer does not try to compensate for the lost events. But it does try
> to compensate for the fractional interval remainder. So one sees:
>
> Expected Actual
> Loop Time Time Interval Lateness
> 1 200 437 437 200+37
> 2 400 609 172 200+9
> 3 600 812 203 200+12
> 4 800 1000 188 200
>
> * In QEmu, when the timer misses it often misses big: 437 ms, 687 ms,
> even 1469 ms. So most of the time expecting three events to take about
> 3 intervals does not help with reliability because the timer does not
> try to compensate the missed events. So at the end it will still be
> off by one interval (200 ms) or more.
>
> * I could not reproduce these big misses on the Windows 8.1 on
> cw-rx460 machine (i.e. real hardware).
This is the real problem, I guess. I mean, the operating system makes no
guarantees about timers firing on time, of course, but when we try to
wait for events to happen and they're frequently late by over a second,
that makes things very difficult to test.
Is it possible the CPU is under heavy load?
>
> * Conclusion:
> - Because Windows constantly adjusts the timer interval so that they
> are only accurate *on average*, testing the total time for N events
> will be more reliable than testing the events individually.
>
> - But the occasional big delays mean any timing tolerances below
> ~1700 ms will be somewhat unreliable :-(
>
> - For timers one could check if a big delay happened in one of the
> loops and either account for it or reset the timer and retry in the
> hope no delay will occur then. But that's overly complex.
>
> - The same big delays are probably also the cause of the
> WaitForSingleObject() failures when the timeout is < 2000 ms.
>
> - The timing in Wine in Debian VMs is very reliable. Yet it is likely
> that QEmu is somehow responsible for the timing issues in the
> Windows VMs.
>
> - The best lead at this point is to check whether some VM clock
> setting is causing this. But some are also need to prevent reverts
> from taking forever[1].
> (w8)
> <clock offset='localtime'>
> <timer name='rtc' tickpolicy='delay'/>
> <timer name='pit' tickpolicy='delay'/>
> <timer name='hpet' present='no'/>
> </clock>
> vs. (w1064)
> <hyperv>
> <relaxed state='on'/>
> <vapic state='on'/>
> <spinlocks state='on' retries='8191'/>
> </hyperv>
> <clock offset='localtime'>
> <timer name='rtc' tickpolicy='catchup' track='guest'/>
> <timer name='pit' tickpolicy='delay'/>
> <timer name='hpet' present='no'/>
> <timer name='hypervclock' present='yes'/>
> </clock>
> vs. (fgtbdebian10)
> <clock offset='utc'/>
>
> - Otherwise try CPU pinning. But Linux seems to do really well without
> it...
>
>
>
> [1] https://bugs.launchpad.net/qemu/+bug/1505041
> https://bugs.launchpad.net/qemu/+bug/1174654
>
More information about the wine-devel
mailing list