It's been about a year since I started collecting data and also since
the GitLab CI has been introduced. So here's an update on the merge
request and nightly Wine test runs false positive rates.
Reminder:
A false positive (FP) is when the TestBot or GitLab CI says a failure
is new when it is not.
* TestBot
The FP rate stayed around 10% until the end of August when the GitLab
bridge to the mailing list got broken (see graphs). Looking at it
differently, except for June on a given day there was a better than
40% chance that less than 10% of the MRs would get a false positive
(and >70% chance for < 25%).
But with the bridge gone the TestBot failures are not relayed to the
MRs anymore and thus collecting data is impractical and quite
irrelevant too.
* GitLab CI
The GitLab CI's FP rate was stayed below 30% until mid May but it has
stayed clearly above since then. The 5 week average even reached a
peak of 60% in early August and it's not getting really better.
Changing perspective, since March less than 20% of the days had a
false positive rate below 10%. And in August and September every
single day had more than 10% of false positives.
Also, before August the chances of having an FP rate lower than 25%
were much greater, usually 40% or more. But that rate has plummeted
and is now below 10%.
The 50% FP line shows great swings which I think are caused by periods
where one or more tests has a 100% failure rate and does not get fixed
for weeks. Still, in early 2023 it was at 85% or more but since then
there has been a clear downward trend where the both the peaks and
troughs keep getting lower.
Conclusions:
* I hoped the TestBot FP rate would improve but it has only held steady.
It may be that this 10% failure rate is incompressible because of the
delay between when a new failure pops up and when the TestBot knows
how to identify it (i.e. when I added it to the known failures page:
https://testbot.winehq.org/FailuresList.pl).
Stemming the flow of new failures introduced by bad MRs may help lower
that rate. But new failures can also happen when a certificate
expires, when a test server goes down, or when changing the build
platform for instance. So there will likely always be a residual FP
rate.
* The GitLab CI seemed to make progress at first but since mid
March it has been getting away from the goal of having no false
positives.
Notes:
* Comparing the TestBot and GitLab CI failure rates is akin to comparing
apples and oranges.
The GitLab CI does a single full test suite (except for a handful of
tests) run in Wine (plus a single 64-bit test).
The TestBot does:
* 1 full 64-bit run in Wine (no exceptions),
* 1 run of modified tests in a Windows-on-Windows Wine environment,
* 1 run of all tests of modified modules in Wine,
* 7 plain 32-bit Wine runs in various locales,
* 24 tests in various Windows, locale, GPU and screen layout
configurations.
And it still gets 1/2 to 1/3 the false positive rate.
* Improving the false positive rate does not mean that the Wine tests
have fewer failures. But getting reliable results from the CI was
deemed to be a necessary step for developpers to trust it and know
they need to rework their MR when the results are bad.
It also means less work for the maintainer to discriminate between MRs
that introduce new failures and those that don't. And less chance to
make mistakes too.
* Conversely, improving the tests does not necessarily improve the false
positive rate. We have 230 failing test units so one can fix 229 of
them but if the last one fails systematically the false positive rate
will stay pegged at 100%.
Reducing the number of false positives requires either focusing on the
tests that cause them, or having counter measures built into the CI...
as is the case for the TestBot.
--
Francois Gouget <fgouget(a)codeweavers.com>