RFC How to get rid of "always new" TestBot false positives?

Tue Mar 31 07:25:41 CDT 2020

On Fri, 27 Mar 2020, Henri Verbeet wrote:
[...]
> If the main goal is to stop the testbot from being ignored, and to
> limit the number of new failures sneaking in, would it make sense to
> start with something fairly blunt, like ignoring failures for tests on
> unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed
> on w1064v1507, but not w1064v1809, you'd then blacklist all of
> ddraw:ddraw7 on w1064v1809. That means you potentially ignore some
> ddraw:ddraw7 tests that are reliable, but it would still be an
> improvement over effectively ignoring everything.

So that would mean maintaining a set of (test:unit, testbot-vm) tuples 
where the TestBot should ignore new failures.

I'm not very fond of the blacklist approach. Once it's in place it may 
be very tempting to just put every flaky test into it rather than fixing 
it. This will lead to a long list of exceptions which will have to be 
maintained. In particular knowing when to remove an entry will be very 
important.

I also worry that once the test failures are papered over there won't be 
much incentive to fix them. To be fair that risk is not really different 
from what could happen with my patch but the scale would be larger.

But it could work with the rare intermittent failures too which would be 
valuable. And it could be useful when introducing new test 
configurations that have new intermittent / variable issues. So there 
could be value in doing this anyway.

Maybe with some safegards it can be made to work.

* I think I'd want a Wine bug describing the issue to be associated with 
  each blacklist entry. That bug should provide some minimal diagnosis: 
  whether it's a new Windows behavior, a race condition or some issue 
  that was reported to QEmu. That would ensure we know why the blacklist 
  entry was added. One could also check the status of the bug when 
  reviewing the blacklist entries. A closed bug would be a strong hint 
  that the blacklist entry is no longer needed.

* And I think it would be better to have a regexp that matches only 
  the troublesome failures rather than to blacklist the whole test unit. 
  Besides being finer grained this would be useful for cases like 
  user32:win which has different issues depending on the locale and 
  where each should be associated to a different bug (bugs 48815, 48819 
  and 48820).

* I think I'd also want to record the time when the blacklist entry was 
  last used. This relies on having the above regular expression since 
  without it the TestBot would not know anything beyond 'the test unit 
  was run and had failures'. Also the regular expression would only be 
  used against *new* failures. So this would really record the last time 
  the blacklist entry was actually useful.

  An entry that was unused for a long time would be a prime candidate 
  for reviewing the corresponding bug and for removal. (Note: The 
  blacklist would also be used on WineTest reports so it would get a 
  chance of matching its target at least 5 days / week).

* I'd want a page listing the blacklisted entries so developers have a 
  good starting point to work on them.

* Ideally the blacklist page would also point to the tasks where the 
  blacklist was last used. I think this would also be useful for 
  developers trying to fix the issues, particularly for the rare 
  intermittent kind.

  Note that Wine VMs often test in multiple configurations per task 
  (e.g. wow32 and wow64, different locales), each producing its own test 
  report. So pointing at just the task would leave the developer 
  guessing which report should be looked at. But that's probably ok.

  More importantly, (test:unit, testbot-vm) tuples make it impossible 
  to blacklist a specific Wine test configuration such as a specific 
  locale since they all run on the same VM. Similarly it would make 
  blacklisting bitness-blind on Windows VMs.

  If necessary the tuple could maybe be extended with the specific 
  mission the blacklist applies to. But I'm not sure on the specific 
  impacts and it may not be worth it.

* Pseudo database schema and sample use:

  FailureBlacklists
  -----------------

  PK Bug             48815
  PK TestModule      user32
  PK TestUnit        win
     Name            0x738 message
     FailureRegExp   Test failed: hwnd [0-9A-F]{8,16} message 0738
     LastUse         2020-03-27

  FailureBlacklistVMs
  -------------------

  PK Bug             48815
  PK TestModule      user32
  PK TestUnit        win
  PK VMName          Entries for w1064v1709 w1064v1809 etc.

  (48815, user32, win, w1064v1709)
  (48815, user32, win, w1064v1809)
  (48815, user32, win, w1064v1809_2scr)
  ...

  FailureBlacklistUses (optionally)
  ---------------------------------

  PK Bug
  PK TestModule
  PK TestUnit
  PK JobId
  PK StepNo
  PK TaskNo

  (48815, user32, win, 68507, 1, 7)
  (48815, user32, win, 68508, 1, 7)
  ...

-- 
Francois Gouget <fgouget at codeweavers.com>