The (broken) state of the WineTestBot
Francois Gouget
fgouget at codeweavers.com
Wed Apr 9 05:18:59 CDT 2014
My apologies for this last bout of WineTestBot brokenness.
It all started when the VM host froze late last Friday and could not be
remotely rebooted. So Newman power-cycled it Monday morning. He also
suggested upgrading the kernel in the hope that this would avoid further
crashes which I agreed to. Then I decided to upgrade to QEMU 1.7 in the
hope of fixing the Dr6 ntdll:exception failures, or at least be in a
better position to do so. Of course that entails redoing all the live VM
snapshots (1) but I was prepared to do so. Then things went south.
The issues
----------
I initially ran into some SELinux incompatibilities and then into
QEMU/Libvirt incompatibilities (2). Then the VMs were suspending after a
few seconds which turned out to be because the disk was full. My fault:
I keep too many VM backups so the new VM backups I created finished
filling it. That corrupted some VMs which, ironically, I was able to
restore from backup after deleting older backups. Now the host now has a
sane amount of free disk space and monitor it closely.
But the real problem is that now the VMs get corrupted after a bit of
use. This manifests through a couple of symptoms:
* Sometimes the build VM will detect EXT4 filesystem corruption,
remount '/' as read-only and obviously stop working properly.
* Sometimes no filesystem corruption is detected but the content of
files gets corrupted. For instance this is what caused all the
'Missing build status line' errors when half of the WineTestBot
Build.pl sscript got lost. I suspect this also caused a round of
build failures related to memcmp() being missing.
* Sometimes the wtbbuild ends up at the grub prompt and complains that
it cannot find the filesystem. Since these are live snapshots this is
presumably preceded by a crash+reboot of the guest.
* Sometimes, after qemu has been properly stopped, a 'qemu-img check'
finds a lot of errors. Sometimes not despite the VM being broken.
* The Windows 2000 VM also sometimes goes south: it resets and fails to
reboot complaining some checksum is corrupted (probably that of the
Windows boot loader).
Of course QEMU behaves while I'm updating a VM's live snapshot. It's
only after I've spent time doing so and creating a backup that it break
the VM (thus also casting doubt into the trustworthiness of the new
backup).
Further compounding the problems, once the WineTestBot is on a tear
compiling a bunch of patches on the lone build VM it's unstoppable (3).
It also tends to get stuck whenever I restart libvirt or the VM host (a
known issue I was working on before this episode).
The strange thing is that QEMU 1.7 seems to work fine on my test
environment. However it now appears that QEMU has at least three very
different codepaths:
* User-mode emulation which is slow and should not be used.
* kvm_intel, uses the Intel VMX instructions, is used on my (Intel)
test environment, has the icebp bug, but seems to otherwise be
reliable.
* kvm_amd, uses the AMD SVM instructions, is used by the WineTestBot's
(AMD) VM host, has the Dr6 bug, and has been unreliable since last
weekend.
As kvm_adm and kvm_intel are kernel modules, the kernel version might
actually be more important than the QEMU one. Indeed my latest tests
seem to indicate that reverting the kernel from the 3.13 to 3.2.0 fixes
the VM corruption issues (I also tested 3.14rc7 which did not help).
A corrollary is that any QEMU tests I can do in my home environment are
likely to be poor predictors for what will happen on the production VM
host. Indeed it's because QEMU 1.7.0 seems to work fine here that I
decided to upgrade the VM host.
Short term goal
---------------
Restore the WineTestBot to a working state!
The current hope is that just reverting to a pre-3.13 kernel, maybe
3.2.0 will do the trick. That would make it possible to stick with QEMU
1.7.0 (Debian 7.0 (Stable/Wheezy) has 1.1.2 which is really too old but
Wheezy-Backports has moved on to 1.7.0 so there's no easily accessible
1.6.0 packages to go back to).
Longer term goals
-----------------
* Solve the VM host crashes. The memory has been tested quite
extensively already, and I did a badblocks pass on the hard-drive.
Neither found anything. I could then test the process using PrimeNet.
but the MRTG graphs did not indicate a tendency to overheat or other
such problems.
It now seems the crashes may be caused by the 3.2.0 kernel, and
probably specifically by the kvm/kvm_amd modules. So finding a more
recent kernel that actually works might help.
* While I hope the VM host will never crash again, it would be nice to
be able to remotely power-cycle it.
* It will also be necessary to fix the stability issues in the 3.13 and
3.14 kvm_amd module. However given that I don't have an AMD box at
hand and already way too many other things to do I don't see how
that's going to happen. Maybe through a bug report once I get a better
handle on this. Still the VM host cannot remain stuck on 3.2.0
indefinitely.
* Given that on AMD the Dr6 bug still seems to be present in QEMU
2.0/Linux 3.13, it still needs to be fixed. (And the icebp one on
Intel would be nice too).
* Fix the mysterious 'network timeout' errors we get while waiting for a
WineTest task to complete. Unfortunately the first set of patches to
tackle them were not really conclusive. So maybe switch to plan B,
i.e. blindly reconnect to work around them. That would be quite
unsatisfying though.
* Then resume work on making the WineTestBot Engine (more) resilient to
network outages and VM host crashes. I started patches for that but
working on them it became clear that this was entangled with proper
handling and diagnostics after 'network timeout' errors.
* Then resume work on all the other features and bugs of the
WineTestBot.
(1) QEmu 1.7.0 cannot restore a 1.6.0 live snapshot made in qemu-system-x86_64
https://bugs.launchpad.net/qemu/+bug/1259499
(2) A known issue caused by QEMU changing the 'qemu-system-x86_64 -cpu
help' output format which is parsed by Libvirt to figure out which
kinds of CPUs can be emulated.
(3) Bug 35946 - Cannot mark a VM for maintenance if it is running a task
http://bugs.winehq.org/show_bug.cgi?id=35946
--
Francois Gouget <fgouget at codeweavers.com>
More information about the wine-devel
mailing list