TestBot: A dive into the Windows timers

Tue Mar 24 23:11:55 CDT 2020

On 3/24/20 10:48 PM, Francois Gouget wrote:
> 
> TL;DR;
> * Windows is bad at hitting precise timing targets.
> * The 15.6 ms Windows timer granularity can have an impact on the tests 
>   reliability.
> * But some unknown factor is causing large delays which are the source 
>   of most timing related failures.
> * The same delays probably also cause the wait timed out failures.
> * Avoiding these failures in the timer tests is tricky.
>   -> Find the source of these delays and figure out how to avoid them.
> 
> 
> My test subject for this dive was test_tp_timer() in 
> ntdll:threadpool, and more specifically the loop which we expect to run 
> in 500 to 750 ms:
> 
> https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpool.c#l1401
> 
> 
> * Recurring timers are nice because one knows exactly how often they 
>   should generate an event which makes it easy to see how badly and how 
>   often they miss.
>   This is unlike most WaitForSingleObject() calls where one does not 
>   know how much time will elapse and all one gets is the unpredictable
>   one-off timeout.
> 
> * Wine on Linux is waaay better than Windows at ensuring the timer 
>   events arrive precisely on time. Even in a QEmu VM on an otherwise 
>   busy machine, every event arrives within 1 ms of the expected time 
>   (so in this case, in Wine we go through the loop once every 200 ms and 
>   rarely 199 or 201 ms).
> 
> * On Windows the default timer resolution is 15.625 ms. So asking for a 
>   200 ms timer interval as in this test is asking for trouble: Windows 
>   will round that to 13 * 15.625 ms = 203 ms and then occasionally make 
>   the interval shorter so the average stays around 200 ms. So we 
>   typically get a pattern of 203 203 203 203 188 ms and again 203...
> 
> * Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes
>   get into a rut, alternating between 172 ms and 219 ms for instance. So 
>   if I tweak the test to go through more loops I'd sometimes get only 4 
>   out of 14 events arriving within 190-210 ms of the previous one! 
>   Whether this happens is heavily influenced by the other processes 
>   running on the machine, such as playing a video in IE.
> 
> * This means that based on just a few events one cannot expect the
>   interval between most events to fall within a narrow range. So here 
>   for instance if the acceptable interval is 190-210 ms and the first 
>   interval is instead 237 ms, then the next one will necessarily be out 
>   of range too, and likely the one after that too. So expecting 2 out 
>   of 3 intervals to be within the range is no more reliable than 
>   checking just one interval.

Allowing for more error than 10ms seems reasonable to me, even by an
order of magnitude.

> 
> * The timer resolution can be changed with NtSetTimerResolution() and 
>   increased to 0.5 ms. But that does not change the results: one still 
>   gets 203, 203, 203, 203, 188 ms. My understanding is that increasing 
>   the timer resolution really only impacts timers that request a higher 
>   resolution. But that means testing a different API.
> 
> * At least increasing the timer resolution seems to prevent Windows from 
>   getting stuck alternating between 172 and 219 ms.
> 
> * Interestingly if one starts a Disk Cleanup and then runs "Clean up 
>   system files", the analysis of the system files drops the timer 
>   interval to 0.5 ms! So trying to see the impact of disk activity can 
>   modify the results in unexpected ways.
> 
> * One can check the current timer resolution with ClockRes command line 
>   tool:
>   https://docs.microsoft.com/en-us/sysinternals/downloads/clockres
> 
> * When the delay is greater than the timer interval, e.g. 437 ms, the 
>   timer does not try to compensate for the lost events. But it does try 
>   to compensate for the fractional interval remainder. So one sees:
> 
>         Expected   Actual
>    Loop   Time      Time   Interval  Lateness
>     1      200       437     437     200+37
>     2      400       609     172     200+9
>     3      600       812     203     200+12
>     4      800      1000     188     200
> 
> * In QEmu, when the timer misses it often misses big: 437 ms, 687 ms,
>   even 1469 ms. So most of the time expecting three events to take about 
>   3 intervals does not help with reliability because the timer does not 
>   try to compensate the missed events. So at the end it will still be 
>   off by one interval (200 ms) or more.
> 
> * I could not reproduce these big misses on the Windows 8.1 on 
>   cw-rx460 machine (i.e. real hardware).

This is the real problem, I guess. I mean, the operating system makes no
guarantees about timers firing on time, of course, but when we try to
wait for events to happen and they're frequently late by over a second,
that makes things very difficult to test.

Is it possible the CPU is under heavy load?

> 
> * Conclusion:
>   - Because Windows constantly adjusts the timer interval so that they 
>     are only accurate *on average*, testing the total time for N events 
>     will be more reliable than testing the events individually.
> 
>   - But the occasional big delays mean any timing tolerances below 
>     ~1700 ms will be somewhat unreliable :-(
> 
>   - For timers one could check if a big delay happened in one of the 
>     loops and either account for it or reset the timer and retry in the 
>     hope no delay will occur then. But that's overly complex.
> 
>   - The same big delays are probably also the cause of the 
>     WaitForSingleObject() failures when the timeout is < 2000 ms.
> 
>   - The timing in Wine in Debian VMs is very reliable. Yet it is likely
>     that QEmu is somehow responsible for the timing issues in the 
>     Windows VMs.
> 
>   - The best lead at this point is to check whether some VM clock 
>     setting is causing this. But some are also need to prevent reverts 
>     from taking forever[1].
>     (w8)
>       <clock offset='localtime'>
>         <timer name='rtc' tickpolicy='delay'/>
>         <timer name='pit' tickpolicy='delay'/>
>         <timer name='hpet' present='no'/>
>       </clock>
>     vs. (w1064)
>       <hyperv>
>         <relaxed state='on'/>
>         <vapic state='on'/>
>         <spinlocks state='on' retries='8191'/>
>       </hyperv>
>       <clock offset='localtime'>
>         <timer name='rtc' tickpolicy='catchup' track='guest'/>
>         <timer name='pit' tickpolicy='delay'/>
>         <timer name='hpet' present='no'/>
>         <timer name='hypervclock' present='yes'/>
>       </clock>
>     vs. (fgtbdebian10)
>       <clock offset='utc'/>
> 
>   - Otherwise try CPU pinning. But Linux seems to do really well without 
>     it...
> 
> 
> 
> [1] https://bugs.launchpad.net/qemu/+bug/1505041
>     https://bugs.launchpad.net/qemu/+bug/1174654
>