TestBot: A dive into the Windows timers

Tue Mar 24 22:48:05 CDT 2020

TL;DR;
* Windows is bad at hitting precise timing targets.
* The 15.6 ms Windows timer granularity can have an impact on the tests 
  reliability.
* But some unknown factor is causing large delays which are the source 
  of most timing related failures.
* The same delays probably also cause the wait timed out failures.
* Avoiding these failures in the timer tests is tricky.
  -> Find the source of these delays and figure out how to avoid them.

My test subject for this dive was test_tp_timer() in 
ntdll:threadpool, and more specifically the loop which we expect to run 
in 500 to 750 ms:

https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpool.c#l1401

* Recurring timers are nice because one knows exactly how often they 
  should generate an event which makes it easy to see how badly and how 
  often they miss.
  This is unlike most WaitForSingleObject() calls where one does not 
  know how much time will elapse and all one gets is the unpredictable
  one-off timeout.

* Wine on Linux is waaay better than Windows at ensuring the timer 
  events arrive precisely on time. Even in a QEmu VM on an otherwise 
  busy machine, every event arrives within 1 ms of the expected time 
  (so in this case, in Wine we go through the loop once every 200 ms and 
  rarely 199 or 201 ms).

* On Windows the default timer resolution is 15.625 ms. So asking for a 
  200 ms timer interval as in this test is asking for trouble: Windows 
  will round that to 13 * 15.625 ms = 203 ms and then occasionally make 
  the interval shorter so the average stays around 200 ms. So we 
  typically get a pattern of 203 203 203 203 188 ms and again 203...

* Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes
  get into a rut, alternating between 172 ms and 219 ms for instance. So 
  if I tweak the test to go through more loops I'd sometimes get only 4 
  out of 14 events arriving within 190-210 ms of the previous one! 
  Whether this happens is heavily influenced by the other processes 
  running on the machine, such as playing a video in IE.

* This means that based on just a few events one cannot expect the
  interval between most events to fall within a narrow range. So here 
  for instance if the acceptable interval is 190-210 ms and the first 
  interval is instead 237 ms, then the next one will necessarily be out 
  of range too, and likely the one after that too. So expecting 2 out 
  of 3 intervals to be within the range is no more reliable than 
  checking just one interval.

* The timer resolution can be changed with NtSetTimerResolution() and 
  increased to 0.5 ms. But that does not change the results: one still 
  gets 203, 203, 203, 203, 188 ms. My understanding is that increasing 
  the timer resolution really only impacts timers that request a higher 
  resolution. But that means testing a different API.

* At least increasing the timer resolution seems to prevent Windows from 
  getting stuck alternating between 172 and 219 ms.

* Interestingly if one starts a Disk Cleanup and then runs "Clean up 
  system files", the analysis of the system files drops the timer 
  interval to 0.5 ms! So trying to see the impact of disk activity can 
  modify the results in unexpected ways.

* One can check the current timer resolution with ClockRes command line 
  tool:
  https://docs.microsoft.com/en-us/sysinternals/downloads/clockres

* When the delay is greater than the timer interval, e.g. 437 ms, the 
  timer does not try to compensate for the lost events. But it does try 
  to compensate for the fractional interval remainder. So one sees:

        Expected   Actual
   Loop   Time      Time   Interval  Lateness
    1      200       437     437     200+37
    2      400       609     172     200+9
    3      600       812     203     200+12
    4      800      1000     188     200

* In QEmu, when the timer misses it often misses big: 437 ms, 687 ms,
  even 1469 ms. So most of the time expecting three events to take about 
  3 intervals does not help with reliability because the timer does not 
  try to compensate the missed events. So at the end it will still be 
  off by one interval (200 ms) or more.

* I could not reproduce these big misses on the Windows 8.1 on 
  cw-rx460 machine (i.e. real hardware).

* Conclusion:
  - Because Windows constantly adjusts the timer interval so that they 
    are only accurate *on average*, testing the total time for N events 
    will be more reliable than testing the events individually.

  - But the occasional big delays mean any timing tolerances below 
    ~1700 ms will be somewhat unreliable :-(

  - For timers one could check if a big delay happened in one of the 
    loops and either account for it or reset the timer and retry in the 
    hope no delay will occur then. But that's overly complex.

  - The same big delays are probably also the cause of the 
    WaitForSingleObject() failures when the timeout is < 2000 ms.

  - The timing in Wine in Debian VMs is very reliable. Yet it is likely
    that QEmu is somehow responsible for the timing issues in the 
    Windows VMs.

  - The best lead at this point is to check whether some VM clock 
    setting is causing this. But some are also need to prevent reverts 
    from taking forever[1].
    (w8)
      <clock offset='localtime'>
        <timer name='rtc' tickpolicy='delay'/>
        <timer name='pit' tickpolicy='delay'/>
        <timer name='hpet' present='no'/>
      </clock>
    vs. (w1064)
      <hyperv>
        <relaxed state='on'/>
        <vapic state='on'/>
        <spinlocks state='on' retries='8191'/>
      </hyperv>
      <clock offset='localtime'>
        <timer name='rtc' tickpolicy='catchup' track='guest'/>
        <timer name='pit' tickpolicy='delay'/>
        <timer name='hpet' present='no'/>
        <timer name='hypervclock' present='yes'/>
      </clock>
    vs. (fgtbdebian10)
      <clock offset='utc'/>

  - Otherwise try CPU pinning. But Linux seems to do really well without 
    it...

[1] https://bugs.launchpad.net/qemu/+bug/1505041
    https://bugs.launchpad.net/qemu/+bug/1174654

-- 
Francois Gouget <fgouget at codeweavers.com>