RFC How to get rid of "always new" TestBot false positives?

Fri Mar 27 07:59:27 CDT 2020

On 3/27/20 12:53 PM, Francois Gouget wrote:
> 
> So I'm resubmitting this patch hoping for some feedback since the last
> time it just dropped off the mailing list [1].
> 
> The issue is that some test failures include values that change on every
> run which prevents the TestBot from knowing whether they are new or not.
> In turn this means the TestBot systematically blames the author of
> whichever patch is being tested for these failures.
> 
> Here are some examples:
> 
> * user32:msg - Test failed: hwnd 0028050C message 0738
> 
>    The message value is important: an unexpected 0738 message has a
>    completely different cause than an unexpected WM_MOUSEFIRST message.
>    So the TestBot should not consider a failure about message 0738 to be
>    identical to a failure about message 0200.
> 
>    But the value of the window handle does not change the nature of the
>    message. So a failure about hwnd 0028050C is identical to one about
>    hwnd 0043050E.
> 
> 
> * kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting
> 
>    WaitCommEvent() took longer than expected and knowing by how much can
>    inform on the right approach to deal with it. But a failure with a
>    1594 ms overrun is the same as one with a 869 ms overrun.
> 
> 
> * comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24
> 
>    The failure message obviously changes every day (because this test
>    can only be about the current date), and yet is the same.
> 
> 
> Because a basic string comparison of past failures to the current one
> always show they are different, the TestBot systematically claims that
> any patch involving these test units is bad. As anyone who cries wolf
> all the time, it ends up being ignored and thus is not serving its
> purpose.
> 
> So the solution I propose is to add some delimiter around the variable
> parts so the TestBot can identify and ignore them when comparing
> failures. This way it will be able to identify the really meaningful
> changes.
> 
> For instance one could enclose the irrelevant parts in
> double-parentheses [2] as follows:
> 
>      Test failed: hwnd ((0028050C)) message 0738
>      Test failed: WaitCommEvent used ((1594)) ms for waiting
>      Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24))     [3]
> 
> 
> Pros:
> * This provides a quick way to significantly improve the TestBot
>    results. Where fixing each and every one of these failures would
>    likely take weeks if not months, adding parentheses where relevant
>    could likely be done by a single developer (probably me) in a week or
>    so.
> 
> * With the "always new" failures out of the picture the Wine developers
>    can focus on the rare intermittent ones, for which there is no quick
>    fix. (Rare intermittent failures are those that happen less than 5% of
>    the time and which are thus not present in a given test
>    configuration's WineTest results history, but which still have a
>    siginificant chance of happening when a patch is tested against 12 or
>    more configurations: 3% < 5% but 12 * 3% = 36%);
> 
> * If the only way to prevent these failures from showing up as "always
>    new" is to fix the failure, it can be tempting to just remove the test
>    or skip it in the cases where it fails (e.g. specific locales). This
>    could mean losing valuable information about how Windows behaves.
> 
> * We will get a steady stream of "always new" failures as new tests are
>    added and the test configurations change (e.g. new Windows versions).
>    Delimiting the variable parts provides a timely way to deal with these
>    new issues.
> 
> * The TestBot patch is really simple (single-liner, 4 lines with
>    comments).
> 
> 
> Cons:
> * Once developers no longer get blamed all the time for these failures
>    they may have less incentive to fix them.
> 
>    -> I don't really buy that argument. Wine developers have shown they
>       can very well ignore these test failures for years and there is no
>       consequence to doing so anyway.
> 
> * There is a risk of failure messages containing the chosen delimiters
>    for unrelated reasons. So using double-quotes as the delimiters as in
>    "0028050C" would be a bad idea.
> 
>    -> But a carefully chosen delimiter greatly reduces that risk and in
>       the worst case it will only cause a few failure messages to not be
>       detected as new. [2]
> 
> * This requires adding 'tagging' to (many) failure messages.
> 
>    -> While failure messages can be tagged preventively, only those
>       that fail really must be tagged. Also I feel the 'tagging' is
>       pretty minimal.
> 
> 
> 
> [1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html
> 
> [2] I'm not dead set on double parentheses. I'm fine with anything that
>      has a low false-positive probability. Note that this eliminates
>      double-quotes, single parentheses "(360, 200)", curly brackets
>      "{360, 200}", single angle brackets "exp<0000000000>", tildes (short
>      pathnames), etc. But pretty much double-anything would work. Maybe
>      double-angle brackets <<0028050C>>?
> 
> [3] In this last example I feel that comparing the failure based on just
>      "Expected, got" makes it too generic.
> 
> 
> diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm
> index 27d38567ea..842ebc8808 100644
> --- a/testbot/lib/WineTestBot/LogUtils.pm
> +++ b/testbot/lib/WineTestBot/LogUtils.pm
> @@ -1086,6 +1086,10 @@ sub _GetLineKey($)
>     my ($Line) = @_;
>     return undef if (!defined $Line);
>   
> +  # Remove variable parts.
> +  # Use a non-greedy match to ensure the ignored part does not contain '))'.
> +  $Line =~ s/\(\(.+?\)\)/(())/g;
> +
>     # Remove the line number
>     $Line =~ s/^([_a-z0-9]+\.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/
> 
> 

IIUC the fix also involves updating the failing tests message to mark 
some information as irrelevant to the failure itself. In this case, why 
not just remove that information from the message?

Because the test failure messages are used to identify individual tests 
and track their results in time they should probably only include 
meaningful information and if some additional information is still 
useful for debugging, add a trace right before the test to print it.

I think the main problem is that it's difficult to uniquely identify a 
given test, while keeping it easy to modify existing or add new ones.

Maybe using the surrounding function name and the test line number 
relative to it would be a good unique identifier instead. I think it 
would reduce the possible false positives to when tests within a 
function are added or reordered, which would be a good occasion to fix 
the failing tests.

That would require some wrapper around test functions as well to track 
relative line numbers, but it's maybe not as bad as going over every 
test message to check and remove irrelevant information.

A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD

(I haven't thought a lot about it and there's maybe some obvious traps I 
missed.)

Cheers,
-- 
Rémi Bernon <rbernon at codeweavers.com>