The (broken) state of the WineTestBot

Francois Gouget fgouget at codeweavers.com
Wed Apr 9 05:18:59 CDT 2014


My apologies for this last bout of WineTestBot brokenness.

It all started when the VM host froze late last Friday and could not be 
remotely rebooted. So Newman power-cycled it Monday morning. He also 
suggested upgrading the kernel in the hope that this would avoid further 
crashes which I agreed to. Then I decided to upgrade to QEMU 1.7 in the 
hope of fixing the Dr6 ntdll:exception failures, or at least be in a 
better position to do so. Of course that entails redoing all the live VM 
snapshots (1) but I was prepared to do so. Then things went south.


The issues
----------

I initially ran into some SELinux incompatibilities and then into 
QEMU/Libvirt incompatibilities (2). Then the VMs were suspending after a 
few seconds which turned out to be because the disk was full. My fault: 
I keep too many VM backups so the new VM backups I created finished 
filling it. That corrupted some VMs which, ironically, I was able to 
restore from backup after deleting older backups. Now the host now has a 
sane amount of free disk space and monitor it closely.


But the real problem is that now the VMs get corrupted after a bit of 
use. This manifests through a couple of symptoms:
 * Sometimes the build VM will detect EXT4 filesystem corruption, 
   remount '/' as read-only and obviously stop working properly.
 * Sometimes no filesystem corruption is detected but the content of 
   files gets corrupted. For instance this is what caused all the 
   'Missing build status line' errors when half of the WineTestBot 
   Build.pl sscript got lost. I suspect this also caused a round of 
   build failures related to memcmp() being missing.
 * Sometimes the wtbbuild ends up at the grub prompt and complains that 
   it cannot find the filesystem. Since these are live snapshots this is 
   presumably preceded by a crash+reboot of the guest.
 * Sometimes, after qemu has been properly stopped, a 'qemu-img check' 
   finds a lot of errors. Sometimes not despite the VM being broken.
 * The Windows 2000 VM also sometimes goes south: it resets and fails to 
   reboot complaining some checksum is corrupted (probably that of the 
   Windows boot loader).

Of course QEMU behaves while I'm updating a VM's live snapshot. It's 
only after I've spent time doing so and creating a backup that it break 
the VM (thus also casting doubt into the trustworthiness of the new 
backup).

Further compounding the problems, once the WineTestBot is on a tear 
compiling a bunch of patches on the lone build VM it's unstoppable (3). 
It also tends to get stuck whenever I restart libvirt or the VM host (a 
known issue I was working on before this episode).

The strange thing is that QEMU 1.7 seems to work fine on my test
environment. However it now appears that QEMU has at least three very 
different codepaths:
 * User-mode emulation which is slow and should not be used.
 * kvm_intel, uses the Intel VMX instructions, is used on my (Intel) 
   test environment, has the icebp bug, but seems to otherwise be 
   reliable.
 * kvm_amd, uses the AMD SVM instructions, is used by the WineTestBot's 
   (AMD) VM host, has the Dr6 bug, and has been unreliable since last 
   weekend.

As kvm_adm and kvm_intel are kernel modules, the kernel version might 
actually be more important than the QEMU one. Indeed my latest tests 
seem to indicate that reverting the kernel from the 3.13 to 3.2.0 fixes 
the VM corruption issues (I also tested 3.14rc7 which did not help).

A corrollary is that any QEMU tests I can do in my home environment are 
likely to be poor predictors for what will happen on the production VM 
host. Indeed it's because QEMU 1.7.0 seems to work fine here that I 
decided to upgrade the VM host.


Short term goal
---------------

Restore the WineTestBot to a working state!

The current hope is that just reverting to a pre-3.13 kernel, maybe 
3.2.0 will do the trick. That would make it possible to stick with QEMU 
1.7.0 (Debian 7.0 (Stable/Wheezy) has 1.1.2 which is really too old but 
Wheezy-Backports has moved on to 1.7.0 so there's no easily accessible 
1.6.0 packages to go back to).


Longer term goals
-----------------

* Solve the VM host crashes. The memory has been tested quite 
  extensively already, and I did a badblocks pass on the hard-drive. 
  Neither found anything. I could then test the process using PrimeNet. 
  but the MRTG graphs did not indicate a tendency to overheat or other 
  such problems.
  It now seems the crashes may be caused by the 3.2.0 kernel, and 
  probably specifically by the kvm/kvm_amd modules. So finding a more 
  recent kernel that actually works might help.

* While I hope the VM host will never crash again, it would be nice to 
  be able to remotely power-cycle it.

* It will also be necessary to fix the stability issues in the 3.13 and 
  3.14 kvm_amd module. However given that I don't have an AMD box at 
  hand and already way too many other things to do I don't see how 
  that's going to happen. Maybe through a bug report once I get a better 
  handle on this. Still the VM host cannot remain stuck on 3.2.0 
  indefinitely.

* Given that on AMD the Dr6 bug still seems to be present in QEMU 
  2.0/Linux 3.13, it still needs to be fixed. (And the icebp one on 
  Intel would be nice too).

* Fix the mysterious 'network timeout' errors we get while waiting for a 
  WineTest task to complete. Unfortunately the first set of patches to 
  tackle them were not really conclusive. So maybe switch to plan B, 
  i.e. blindly reconnect to work around them. That would be quite 
  unsatisfying though.

* Then resume work on making the WineTestBot Engine (more) resilient to 
  network outages and VM host crashes. I started patches for that but 
  working on them it became clear that this was entangled with proper 
  handling and diagnostics after 'network timeout' errors.

* Then resume work on all the other features and bugs of the 
  WineTestBot.



(1)  QEmu 1.7.0 cannot restore a 1.6.0 live snapshot made in qemu-system-x86_64 
    https://bugs.launchpad.net/qemu/+bug/1259499

(2) A known issue caused by QEMU changing the 'qemu-system-x86_64 -cpu 
    help' output format which is parsed by Libvirt to figure out which 
    kinds of CPUs can be emulated.

(3) Bug 35946 - Cannot mark a VM for maintenance if it is running a task
    http://bugs.winehq.org/show_bug.cgi?id=35946

-- 
Francois Gouget <fgouget at codeweavers.com>



More information about the wine-devel mailing list