RFC How to get rid of "always new" TestBot false positives?

Tue Mar 31 09:44:57 CDT 2020

On 3/31/20 7:25 AM, Francois Gouget wrote:
> On Fri, 27 Mar 2020, Henri Verbeet wrote:
> [...]
>> If the main goal is to stop the testbot from being ignored, and to
>> limit the number of new failures sneaking in, would it make sense to
>> start with something fairly blunt, like ignoring failures for tests on
>> unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed
>> on w1064v1507, but not w1064v1809, you'd then blacklist all of
>> ddraw:ddraw7 on w1064v1809. That means you potentially ignore some
>> ddraw:ddraw7 tests that are reliable, but it would still be an
>> improvement over effectively ignoring everything.
> 
> So that would mean maintaining a set of (test:unit, testbot-vm) tuples
> where the TestBot should ignore new failures.
> 
> I'm not very fond of the blacklist approach. Once it's in place it may
> be very tempting to just put every flaky test into it rather than fixing
> it. This will lead to a long list of exceptions which will have to be
> maintained. In particular knowing when to remove an entry will be very
> important.
> 
> I also worry that once the test failures are papered over there won't be
> much incentive to fix them. To be fair that risk is not really different
> from what could happen with my patch but the scale would be larger.

I'm inclined to think this will happen with any approach that silences 
failing tests, including your original proposition.

We wouldn't do this for test.winehq.org, presumably (and as I 
understand, the two systems are different enough that it's not difficult).

> 
> But it could work with the rare intermittent failures too which would be
> valuable. And it could be useful when introducing new test
> configurations that have new intermittent / variable issues. So there
> could be value in doing this anyway.
> 
> Maybe with some safegards it can be made to work.
> 
> * I think I'd want a Wine bug describing the issue to be associated with
>    each blacklist entry. That bug should provide some minimal diagnosis:
>    whether it's a new Windows behavior, a race condition or some issue
>    that was reported to QEmu. That would ensure we know why the blacklist
>    entry was added. One could also check the status of the bug when
>    reviewing the blacklist entries. A closed bug would be a strong hint
>    that the blacklist entry is no longer needed.

That seems reasonable regardless of what approach we take.

> * And I think it would be better to have a regexp that matches only
>    the troublesome failures rather than to blacklist the whole test unit.
>    Besides being finer grained this would be useful for cases like
>    user32:win which has different issues depending on the locale and
>    where each should be associated to a different bug (bugs 48815, 48819
>    and 48820).
> 
> * I think I'd also want to record the time when the blacklist entry was
>    last used. This relies on having the above regular expression since
>    without it the TestBot would not know anything beyond 'the test unit
>    was run and had failures'. Also the regular expression would only be
>    used against *new* failures. So this would really record the last time
>    the blacklist entry was actually useful.
> 
>    An entry that was unused for a long time would be a prime candidate
>    for reviewing the corresponding bug and for removal. (Note: The
>    blacklist would also be used on WineTest reports so it would get a
>    chance of matching its target at least 5 days / week).
> 
> * I'd want a page listing the blacklisted entries so developers have a
>    good starting point to work on them.
> 
> * Ideally the blacklist page would also point to the tasks where the
>    blacklist was last used. I think this would also be useful for
>    developers trying to fix the issues, particularly for the rare
>    intermittent kind.
> 
>    Note that Wine VMs often test in multiple configurations per task
>    (e.g. wow32 and wow64, different locales), each producing its own test
>    report. So pointing at just the task would leave the developer
>    guessing which report should be looked at. But that's probably ok.
> 
>    More importantly, (test:unit, testbot-vm) tuples make it impossible
>    to blacklist a specific Wine test configuration such as a specific
>    locale since they all run on the same VM. Similarly it would make
>    blacklisting bitness-blind on Windows VMs.
> 
>    If necessary the tuple could maybe be extended with the specific
>    mission the blacklist applies to. But I'm not sure on the specific
>    impacts and it may not be worth it.
> 
> 
> * Pseudo database schema and sample use:
> 
>    FailureBlacklists
>    -----------------
> 
>    PK Bug             48815
>    PK TestModule      user32
>    PK TestUnit        win
>       Name            0x738 message
>       FailureRegExp   Test failed: hwnd [0-9A-F]{8,16} message 0738
>       LastUse         2020-03-27
> 
> 
>    FailureBlacklistVMs
>    -------------------
> 
>    PK Bug             48815
>    PK TestModule      user32
>    PK TestUnit        win
>    PK VMName          Entries for w1064v1709 w1064v1809 etc.
> 
>    (48815, user32, win, w1064v1709)
>    (48815, user32, win, w1064v1809)
>    (48815, user32, win, w1064v1809_2scr)
>    ...
> 
> 
>    FailureBlacklistUses (optionally)
>    ---------------------------------
> 
>    PK Bug
>    PK TestModule
>    PK TestUnit
>    PK JobId
>    PK StepNo
>    PK TaskNo
> 
>    (48815, user32, win, 68507, 1, 7)
>    (48815, user32, win, 68508, 1, 7)
>    ...
> 

I think all of this looks reasonable to me as well.