RFC How to get rid of "always new" TestBot false positives?

Fri Mar 27 06:53:53 CDT 2020

So I'm resubmitting this patch hoping for some feedback since the last 
time it just dropped off the mailing list [1].

The issue is that some test failures include values that change on every 
run which prevents the TestBot from knowing whether they are new or not. 
In turn this means the TestBot systematically blames the author of 
whichever patch is being tested for these failures.

Here are some examples:

* user32:msg - Test failed: hwnd 0028050C message 0738

  The message value is important: an unexpected 0738 message has a 
  completely different cause than an unexpected WM_MOUSEFIRST message. 
  So the TestBot should not consider a failure about message 0738 to be 
  identical to a failure about message 0200.

  But the value of the window handle does not change the nature of the 
  message. So a failure about hwnd 0028050C is identical to one about 
  hwnd 0043050E.


* kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting

  WaitCommEvent() took longer than expected and knowing by how much can 
  inform on the right approach to deal with it. But a failure with a 
  1594 ms overrun is the same as one with a 869 ms overrun.


* comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24

  The failure message obviously changes every day (because this test 
  can only be about the current date), and yet is the same.


Because a basic string comparison of past failures to the current one 
always show they are different, the TestBot systematically claims that 
any patch involving these test units is bad. As anyone who cries wolf 
all the time, it ends up being ignored and thus is not serving its 
purpose.

So the solution I propose is to add some delimiter around the variable 
parts so the TestBot can identify and ignore them when comparing 
failures. This way it will be able to identify the really meaningful 
changes.

For instance one could enclose the irrelevant parts in 
double-parentheses [2] as follows:

    Test failed: hwnd ((0028050C)) message 0738
    Test failed: WaitCommEvent used ((1594)) ms for waiting
    Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24))     [3]


Pros:
* This provides a quick way to significantly improve the TestBot 
  results. Where fixing each and every one of these failures would 
  likely take weeks if not months, adding parentheses where relevant 
  could likely be done by a single developer (probably me) in a week or 
  so.

* With the "always new" failures out of the picture the Wine developers 
  can focus on the rare intermittent ones, for which there is no quick 
  fix. (Rare intermittent failures are those that happen less than 5% of 
  the time and which are thus not present in a given test 
  configuration's WineTest results history, but which still have a 
  siginificant chance of happening when a patch is tested against 12 or 
  more configurations: 3% < 5% but 12 * 3% = 36%);

* If the only way to prevent these failures from showing up as "always 
  new" is to fix the failure, it can be tempting to just remove the test 
  or skip it in the cases where it fails (e.g. specific locales). This 
  could mean losing valuable information about how Windows behaves.

* We will get a steady stream of "always new" failures as new tests are 
  added and the test configurations change (e.g. new Windows versions). 
  Delimiting the variable parts provides a timely way to deal with these 
  new issues.

* The TestBot patch is really simple (single-liner, 4 lines with 
  comments).


Cons:
* Once developers no longer get blamed all the time for these failures 
  they may have less incentive to fix them.

  -> I don't really buy that argument. Wine developers have shown they 
     can very well ignore these test failures for years and there is no 
     consequence to doing so anyway.

* There is a risk of failure messages containing the chosen delimiters 
  for unrelated reasons. So using double-quotes as the delimiters as in 
  "0028050C" would be a bad idea.

  -> But a carefully chosen delimiter greatly reduces that risk and in 
     the worst case it will only cause a few failure messages to not be 
     detected as new. [2]

* This requires adding 'tagging' to (many) failure messages.

  -> While failure messages can be tagged preventively, only those 
     that fail really must be tagged. Also I feel the 'tagging' is 
     pretty minimal.



[1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html

[2] I'm not dead set on double parentheses. I'm fine with anything that 
    has a low false-positive probability. Note that this eliminates 
    double-quotes, single parentheses "(360, 200)", curly brackets 
    "{360, 200}", single angle brackets "exp<0000000000>", tildes (short 
    pathnames), etc. But pretty much double-anything would work. Maybe 
    double-angle brackets <<0028050C>>?

[3] In this last example I feel that comparing the failure based on just
    "Expected, got" makes it too generic.

diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm
index 27d38567ea..842ebc8808 100644
--- a/testbot/lib/WineTestBot/LogUtils.pm
+++ b/testbot/lib/WineTestBot/LogUtils.pm
@@ -1086,6 +1086,10 @@ sub _GetLineKey($)
   my ($Line) = @_;
   return undef if (!defined $Line);
 
+  # Remove variable parts.
+  # Use a non-greedy match to ensure the ignored part does not contain '))'.
+  $Line =~ s/\(\(.+?\)\)/(())/g;
+
   # Remove the line number
   $Line =~ s/^([_a-z0-9]+\.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/


-- 
Francois Gouget <fgouget at codeweavers.com>