TestBot: A dive into the Windows timers

Wed Mar 25 11:53:59 CDT 2020

On Tue, 24 Mar 2020, Zebediah Figura wrote:
[...]
> > * This means that based on just a few events one cannot expect the
> >   interval between most events to fall within a narrow range. So here 
> >   for instance if the acceptable interval is 190-210 ms and the first 
> >   interval is instead 237 ms, then the next one will necessarily be out 
> >   of range too, and likely the one after that too. So expecting 2 out 
> >   of 3 intervals to be within the range is no more reliable than 
> >   checking just one interval.
> 
> Allowing for more error than 10ms seems reasonable to me, even by an
> order of magnitude.

The test tolerances are not that tight, as far as I know, and certainly 
not for this threadpool timer test. That was just me testing an 
alternative approach and finding it to not be viable. As I said in this 
specific case the allowed range is 500-750 for an expected 600 ms (3*200 
ms).

But there are cases in other tests where we do a TerminateProcess() or 
similar and expect the WaitForSingleObject() to return within 100 ms. I 
don't think those are correct. Even 1s feels short. The recent 
kernel32:process helper functions replaced a bunch of them with 
wait_child_process() calls so now the timeout is 30s. I may align the 
remaining timeouts with that... though I feel 30s is a bit large. Surely 
10s should be enough?

[...]
> > * In QEmu, when the timer misses it often misses big: 437 ms, 687 ms,
> >   even 1469 ms. So most of the time expecting three events to take about 
> >   3 intervals does not help with reliability because the timer does not 
> >   try to compensate the missed events. So at the end it will still be 
> >   off by one interval (200 ms) or more.
> > 
> > * I could not reproduce these big misses on the Windows 8.1 on 
> >   cw-rx460 machine (i.e. real hardware).
> 
> This is the real problem, I guess. I mean, the operating system makes no
> guarantees about timers firing on time, of course, but when we try to
> wait for events to happen and they're frequently late by over a second,
> that makes things very difficult to test.
> 
> Is it possible the CPU is under heavy load?

Not really, no. There's really not much running on the VM hosts:

* VMs
  We run at most one VM at a time per host, precisely to make sure the 
  activity in one VM does not interfere with the tests running in the 
  other VM(s). Of course it make the TestBot pretty inefficient and it 
  also does not prevent these delays :-(

* Unattended upgrades
  Once a day apt will check for security updates and install them. But 
  on Debian stable that should not amount to much.

* Acts of administrator
  Mostly VM backup/restore, debugging, reconfiguring. But these are too 
  infrequent to explain all the delays we get.

Also I'm not convinced CPU load on the host is the cause of these 
delays.

-- 
Francois Gouget <fgouget at codeweavers.com>