Your weekly winetest update

Fri Mar 9 16:45:49 CST 2018

On Thu, 1 Feb 2018, Zebediah Figura wrote:
[...]
> Also along these lines: presumably setting up the testbot to run linux
> tests is going to require a fair amount of work, and that work is going
> to be implied to be François' responsibility. I'd just like to say that
> I'm willing to help as much as I can, though.

Here is a description of how we could get a very basic implementation 
together:
(base series)

b1. We need a new VM type so the TestBot knows which VMs to run the Wine 
    tests on (see VMs.pm, ddl/winetestbot.sql and ddl/update*.sql). 
    We could call it something like 'unix' or 'wine'.

b2. There are multiple types of Tasks and each one is handled by a 
    WineRunXxx.pl script. The script being used depends on the Step->Type 
    field so we will need new Step types (see Steps.pm and the same ddl 
    files).

b3. CheckForWinetestUpdate will need to be updated to create a job with a 
    Step of the right type, for instance unixreconfig, to update Wine on 
    the Unix machine(s) when there is a Git commit in Wine.

b4. To process the tasks in the new unixreconfig step type we will need 
    two new scripts: WineRunUnixReconfig.pl and build/UnixReconfig.pl. 
    note that they should *not* run WineTest: WineRunUnixReconfig.pl will 
    have to update the VM snapshot so later compilations start from the 
    new Wine baseline. But running WineTest could ruin the VM's test 
    environment (crash the X server, etc) which we would not want in the 
    snapshot we'll use to run later tests.

    Now that we have proper dependency support between Steps we could also 
    add the UnixReconfig Step as an extra Step in the usual "Wine Update" 
    job and just make sure it does not depend on the classic Reconfig step 
    so that a failure in one does not cause the other to be skipped. The 
    choice is a mater of taste.

b5. We will need a new script to run the tests in the Unix VMs. Maybe 
    call it WineRunUnixTask.pl. Unlike WineRunTask.pl which runs the tests 
    on Windows, WineRunUnixTask.pl will need to deal with both patches and 
    executables.

b6. Modify Patches::Submit() to create tasks for the unix/wine VMs.
    Currently it only creates jobs for the patches that modify the tests. 
    If a patch series contains test and non-test parts it combines them 
    one job which suits our purpose just fine. So for a basic 
    implementation we could keep that as is.

    Then Patches::Submit() creates one job per dll that it needs to run 
    tests on. So if a patch touches the tests of d3d8 and d3d9 that's 2 
    jobs. In fact this should be changed because it does not mesh well 
    with https://source.winehq.org/patches/ which expects precisely one 
    job per patch.

    So assuming the above is fixed, if we get a patch that touches the 
    device and visual unit tests in d3d8 and d3d9 we will currently get a 
    job that looks something like this (here indentation represents the 
    dependencies between steps):
      1 Build d3d8_test.exe and d3d8_test64.exe
         2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 
         3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM
         4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 
         5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM
      6 Build d3d9_test.exe and d3d9_test64.exe
         7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 
         8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM
         9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 
        10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM

    The simplest approach would be to add the unix/wine tests as a single 
    extra step that does the build and runs all the test units.
      1 Build d3d8_test.exe and d3d8_test64.exe
         2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 
         3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM
         4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 
         5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM
      6 Build d3d9_test.exe and d3d9_test64.exe
         7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 
         8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM
         9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 
        10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM
     11 All test units - all bitness - 1 task per Unix VM
         `-> run d3d8:device d3d8:visual d3d9:device d3d9:visual

    For the unix step the TestBot would either send the patch or test 
    executable and provide the list of test units to run. Of course that 
    means only testing the patches that modify the tests. Also you'll 
    notice there's no mention of the 32 bit vs WoW wineprefix distinction. 
    We'd just run all the relevant tests, 32 bit, 32 bit Wow, and 64 bit 
    WoW in the same task.

    As I said it's the simplest approach but probably not what we want. 
    I'll discuss alternatives below.

b7. In addition to the WineRunUnixReconfig.pl changes, 
    CheckForWinetestUpdate needs to be updated to create 
    WineRunUnixTask.pl tasks to run the official WineTest executables 
    just like we currently do on Windows. These could get tacked on the 
    existing jobs or go into a separate job like the 'Other VMs' job.

b8. Last but not least, create one or more Unix VM to run the tests on, 
    with all the development packages and the right Window manager and 
    settings.

    Note that this would be separate from the standard build VM in part 
    because both would need different Type fields. But also the current 
    build VM that generates the Windows executables for the Windows tests 
    uses MinGW and does not need any of the native Unix development 
    libraries. This means it rarely breaks or needs updates when the Wine 
    build dependencies change, unlike the new unix test VMs which will 
    more likely need regular updates.

b9. We normally give 2 minutes for the Windows tasks to run. However they 
    only run a single test unit each whereas the unix tests will need to 
    run many test units so that 2 minutes will be too short.

    - The 2 minutes is in part to make sure the tests don't take too long 
      to run (that limit matches the WineTest.exe limit), and in part to 
      make sure nasty patchs sent to wine-devel don't get to use our 
      infrastructure for too long. So there is some value to keep it as 
      short as possible.

    - For the full test suite we currently have a 30 minutes timeout. So 
      in theory we could simply add 2 minutes per test until we reach the 
      30 minutes limit. Past 3 test units that seems overly generous 
      though. So we could do something a bit more sophisticated like 2 
      minutes for the first 3 test units and 30 seconds after that up to 
      the time limit.

    - The exact algorithm probably does not matter much as long as we 
      don't get spurious timeouts. It should also be easy to adjust 
      independently of the rest of the code.

Now the above is quite limited and not really what we want so let's 
see what's missing and what the impact of adding it has.

First one of the things we want is to check that the test patches sent to 
wine-devel compile. This can be built on top of the above.
(compilation series)

c1. The first change is to modify the code that creates the jobs,
    Patches::Submit(), so it does not ignore patches that don't touch the 
    tests.

c2. When a patch touches a test it can work exactly like above, but 
    in addition it should create jobs for the other patches. These new 
    jobs would only contain a single unix step of the same form as the 
    tests above but with an empty list of test units to run.

c3. When given an empty list of test units to run the UnixTask.pl 
    script should still perform the build but skip running the tests.

c4. The WineRunUnixTask.pl script should not complain if there only a 
    build log and no test log. A missing log file may be hard to 
    distinguish from a bug so this may require putting some special code 
    like "Test log intentionally left blank" in the log so the script 
    knows everything is fine.

But what we really want is to also test non test Wine patches sent to 
wine-devel so we can make sure they don't break the tests. Here's a 
starting point for that:
(dll tests series)

d1. Whenever a test touches a dll, rerun all the tests in that dll.
    This is a simple extension of the compilation series where instead of 
    passing an empty test unit list for non-test patches we either pass a 
    list of all that dll's test unit (like we currently do for the non-C 
    patches), or an empty list if there is no test for that dll.

    This will make checking some patches a bit slow, particularly for some 
    dlls that have a lot of test units like mshtml for instance. But the 
    TestBot should be able to handle the extra load.

d2. For patches that don't modify a specific dll (or program) we'd just 
    pass an empty list.

However a patch that modifies a dll such as ntdll.dll for instance could 
essentially break any test. So what we really, really want is to run a 
much more complete range of tests for these patches.
(all tests series)

a1. The simplest approach would be to tweak the above code to 
    systematically rerun every test (or WineTest) for every patch. As 
    before this would include the 32 bit, 32 bit Wow and 64 bit Wow set of 
    tests.

a2. In theory this task's timeout should be 3 * ($ReconfigTimeout +
    $SuiteTimeout). With the default values that would be 3 hours which 
    seems way too large.

a3. A Wine rebuild takes 3.5 minutes on average and running the full 
    WineTest suite takes about 19 minutes (see the TestBot statistics). 
    Even taking into account these averages means a job would take at 
    least 1 hour. This has to be compared to the rate at which jobs arrive 
    which, excluding every non-test patch, currently stands at about 1.3 
    job per hour.

    So this very unlikely to be sustainable.

How to make it possible is not entirely clear and below I'll investigate 
various options.
(options)

o1. The simplest option for reducing the load would be to reduce the 
    number of bitnesses we test. For instance we could drop the 32 bit 
    tests in favor of the 32 bit WoW ones. Or, in a more radical move, 
    drop everything but the basic 32 bit tests. This means we would miss 
    some failures but if that's rare enough it may be acceptable. This 
    would still leave us with tasks that take about 22 minutes so this may 
    not be sufficient either.

o2. The approach that BuildBot takes is to test multiple patches together.
    If multiple patches arrive in a 5 minutes window (or before the 
    previous testing round is finished) it can put them all together and 
    only start one new test round.

    - We could conceivably do something like this for the TestBot but it 
      would require pretty big changes to the way it operates. For 
      instance the 'patches' website assumes there is one job per patch. 
      But here this would not be the case.

    - When a job fails we would not know which patch caused the failure. 
      We could solve that by doing a 'bisect' of sorts but then this 
      compounds the complexity so it's probably not the right approach.

    - It also would not work for manually submitted jobs.

    - And then there's the question of how it would mesh with the 
      existing Windows tests: bunch them up too? Keep them separate and 
      end up with multiple jobs per patch?

    - So overall this really does not seem like an approach we could use 
      in the TestBot.

o3. Another option would be to split the work so it can be parallelized:
    have one step that only does the 32 bit tests on one of the VMs (1 
    build + 1 WineTest run so 22 minutes expected run time), and another 
    step that does the 32 and 64 bit WoW tests in parallel on another VM 
    (running on another host). But that second VM would still have an 
    expected run time of about 45 minutes so that would likely be 
    insufficient.

o4. We could instead separate the build step from the test one(s). This 
    would allow us to run the 32 bit, 32 bit Wow and 64 bit Wow tests in 
    parallel in separate VMs, meaning we should be able to handle about 3 
    jobs per hour.

    - This option can be tempting if we have multiple test environments. 
      For instance it could allow us to build once and then run the tests 
      in GNOME, KDE and LXDE environments for instance (although none of 
      these have window managers that are up to the task so maybe a better 
      exemple would be to run the same binaries in English, French and 
      Hebrew locales).

    - Also while this would work fine for running the tests multiple times 
      on the same Linux distribution, we would probably not want to build 
      on Debian and then run the tests on Red Hat. So this may be of 
      limited usefulness.

    - Furthermore if we have really different types of Unix systems such 
      as Linux and FreeBSD (or Mac assuming it can fit in this framework 
      at all), we would need a way to make sure that a binary built on 
      Linux is then not sent to a FreeBSD test machine. One approach could 
      be to create more VM types (have linux and freebsd instead of just 
      unix, and handle both with the same scripts), but that seems to lead 
      to an explosion in the number of VM types. Especially if we then go 
      with Debian, Red Hat, Arch, etc. It's certainly possible to find a 
      solution though (VM subtypes?).

    - Another drawback is that this requires transferring the Wine and 
      test binaries from the build VM to the TestBot server and then to 
      each of the test VMs. This would be about 20MB compressed per 
      bitness for every job. This adds to the time spent running each 
      Task, to the WineHQ.org bandwidth consumption and the TestBot disk 
      usage.

    - An optimisation would be to let the 'Wine update' job catch up all 
      the test VMs to the latest Wine binaries to establish a new 
      baseline, and to then only send the binary changes.

    - The simplest form of diff would be to only send the modified files 
      (based on the modified timestamp), but there are probably a lot of 
      binary diff tools we could also use.

    - The main issue with all these 'diff' approaches is keeping the 
      baseline of the build VM and the test VMs in sync. We run the risk 
      of having a sequence such as:
      1 Generate a binary diff for job 1.
      2 Update the build VM.
      3 Synchronize the test VM with the new binary baseline.
      4 Try to apply the diff generated in 1 to the test VM's new 
        binaries.

o5. Instead we could replicate the Unix VM(s). So instead of having the 
    host running the single Unix VM be the bottleneck, we could throw more 
    hardware at the issue to increase our job processing throughput.

    - This assumes that the test VM really behaves exactly the same way no 
      matter which host it is on otherwise the results could be pretty 
      confusing. That should already be the case but our hosts are not 
      entirely identical (3 Intel processors, 1 AMD but most VMs use a 
      neutral 'kvm32' processor) and this has never really be thoroughly 
      verified.

    - Note that although the job throughput would be increased, the job 
      latency would still remain at the usual

    - See bug 39412 for (upcoming) details on how the TestBot could 
      implement load balancing between the hosts.
      https://bugs.winehq.org/show_bug.cgi?id=39412

    - The further benefit is that failover would likely come for free. 
      This means that with enough VM duplication, when one host freezes 
      the tasks would automatically be handled by the other hosts which 
      would mean less (or no) downtime.

o6. A more sophisticated approach would be to analyze the dlls the tests 
    depend on and only rerun those that can be impacted.

    - For instance it looks like modifying the ws2_32 source would only 
      impact the secur32, webservices, winhttp, ws2_32 and wsdapi tests, 
      which means running 21 test units instead of the 500+ of the full 
      suite. Also changing the source in a dll with no test and not used 
      anywhere else (e.g. hal.dll) means we could skip running the tests 
      entirely.

    - This analysis could be done on the VM right after rebuilding Wine: 
      see which binaries changed, then look for them in the Makefile 
      IMPORTS lines. But this would miss tests and dlls that do a 
      LoadLibrary(). So a more sophisticated analysis may be called for, 
      or we may need to provide the extra dependency information manually 
      somehow (either in the Wine source or in some TestBot table).

    - This would mean the unix tasks would determine which test units to 
      run on their own rather than having the TestBot tell them as 
      proposed in the dll tests series. But the change should be minor.

    - It's hard to predict how much this would reduce processing time and 
      thus whether it would be sufficient. This really depends on the 
      ratio of low-level header / dll patches versus high level dll with 
      no tests patches. So the only way to see if that work is probably to 
      try it out.

    - If not sufficient on its own it can be combined with other options, 
      particularly the load balancing one (o5). It may also be easier to 
      implement than o5, depending on how hard the dependency analysis 
      turns out to be.

Given the above I think it makes sense to start implement things 
progressively from the base series to the compilation on, to the dll one 
to the all series. Each step will be able to build on the previous one and 
provide us with new information about the extra TestBot load, how many 
jobs we really get per hour, and also whether the test results are 
reliable, etc.

Should we approach the TestBot limits at any point we'll be able to stop 
expanding the tests and still have more than what we currently have while 
we figure out how to proceed further. I also think that no matter what 
happens and where we stop, the work done will be reused when proceeding 
further.

Further bells and whistles:
(whistles)

w1. The current tasks only do one thing and produce a single log. For 
    some approches a single task may do a build, run 32 bit tests, then 64 
    bit tests. Having all this go into a single log may be confusing when 
    looking at it on the web site. So it could make sense to create a 
    separate log for each 'subtask'.

    - This would require modifying the WineRunXxx.pl and associated build 
      scripts of course, as well as the the JobDetails.pl web page so it 
      can show each log; but also the code canceling and restarting Tasks 
      so they clean the old logs correctly.

    - The JobDetails.pl page could have "Show full 32 bit log" and "Show 
      full 64 bit log" links for instance.

    - Note also that we already support showing the results of multiple 
      test units from a single log for WineTest. So that aspect should not 
      require any change.

w2. The job submission page lets developers cross a checkbox to run 
    the 64 bit Windows tests in addition to the 32 bit ones. If we do have 
    3 types of tests on Unix we may want to expand that. This may also 
    further require analyzing the whether the user picked a Unix VM to 
    present the right option.

w3. The current Windows tests reset the test environment entirely 
    between test units. The Unix tests would not. It's not clear that we 
    should. After all WineTest needs to be able to run all the tests with 
    no cleanup between them. Still it may make sense to do a cleanup 
    between bitnesses, at least resetting the WIKNEPREFIX. But doing a 
    more thorough cleanup would involve at least restarting the X server 
    between tests. That could be bothersome.

-- 
Francois Gouget <fgouget at codeweavers.com>