[00/25] JobDetails performance and new failures patchset

Tue Jan 14 09:40:36 CST 2020

I'm sending this patchset here for review and to provide an overview of 
what's to come. It's quite large so it shouldn't be committed in one go 
on the TestBot production system (hence the lack of the 'PATCH' prefix 
and signing off, those will go with the actual submission of the final 
patches).

The main goal of this patchset is to reduce the "new error" false 
positives caused by intermittent failures (bug 47998). But fixing this 
has a number of impacts which makes the patchset larger.

Some context: the TestBot detects new failures introduced by a patch by 
comparing its result to that of the last WineTest run. Any failure not 
present in the WineTest run is considered to be caused by the patch. 
This breaks down when a failure is intermittent and just happens on the 
patch's test run.

To avoid this the pathset modifies the TestBot to use all past WineTest 
runs as the comparison point. Only failures not present in any of them 
are considered new and attributed to the patch.

But the TestBot does the new failures detection when generating the 
JobDetails page (and also when sending the emails). This means parsing 
the WineTest results (600KB to 1.5MB) to extract failures, the same for 
the test results, and diffing them, for each test configuration 
(typically about 20), every time the page is loaded or refreshed. This 
results in typical load times between 0.2 and 2 seconds on my test 
system (up to 5 seconds on the official TestBot) (bug 48035). Annoying 
but not a major issue. But these can put quite a load on the web server 
when a rogue spider decides to crawl the TestBot site as happened a few 
times in december. Using all WineTest results on record instead of just 
the latest would increase the average load times by a factor of 20. This 
would make the page load time unacceptable and put too much load on the 
web server.

This means the JobDetails performance issues must be tackled first. 
Initially my idea was to just cache the errors lists. But in fact it's 
just as easy to also cache whether each error is new and thus do the 
detection only once. So the patchset introduces .errors files which 
contain only the errors extracted from a given task's log, whether each 
error is new, and its line number in the full report which allows for 
some future enhancements. With this change, generating the JobDetails 
page takes 0.01 to 0.15 seconds on my test system.

The structures used to manipulate the error lists were not practical and 
hardly extensible. In particular there was not a single list with a flag 
denoting new errors, but two lists, one with all the errors and another 
with only the new errors (see TagNewErrors()). Furthermore the error 
groups and errors in each group were returned separately (see 
GetLogErrors()), making them annoying to pass around and even harder to 
extend with more information such as line numbers. So the early patches 
rework the way errors are stored in memory to simplify usage and make 
their representation more extensible.

Before the patchset the WineRun* scripts would parse the test reports 
and store the extra validation errors in .err files (e.g. the too much 
output errors). But it's just as easy to store those in the .errors 
files, making the .err ones redundant. So they are removed on upgrade.

The web server is the wrong place to generate these 'cache' .errors 
files, in part because it does not have write access to the jobs/ 
directory. It's more logical to have each WineRun* script generate those 
when it completes. Then the .errors files are available when WineSendLog 
and JobDetails need them.

Because the new failures detection was performed each time JobDetails 
was run, it could happen long after the job completed. By that time 
there could be new WineTest results so simply using the latest ones 
would have changed the list of new failures of past jobs which would 
have made no sense. With the .errors files the new failures detection is 
performed only once when a task completes. So keeping a copy of the 
WineTest reference files is not really necessary anymore. But it's still 
useful to have them to diagnose issues.

The WineRunBuild script used to copy the latest WineTest reports to each 
of the job's step directories when the build completed. Instead it's 
much more logical to have each WineRun* script capture the WineTest 
reports it needs when it completes its task and to store them in the 
task's directory.

The reference WineTest results are stored in the latest/ directory so 
each needs a unique name. That name used to be of the form 
'<vmname>_<reportname>', for instance debian10_win32_fr_FR.report'. So 
this naming scheme was updated to allow for more than one report per 
test configuration. Specifically it now includes the job id since 
there's only one report per job for a given test configuration and this 
makes it easy to know where the reference report is coming from. Also 
underscores are allowed in both the VM names and report names which 
makes it hard to split the filename into its components. So the 
underscore was replaced by dashes. So that gives 
'debian10-job63174-win32_fr_FR.report'.

The Engine used to perform the renaming / upgrade operations on startup. 
But generating the .errors files for every single job takes several 
minutes which makes it inappropriate as an Engine startup task. Also in 
order to work on the patchset it is necessary to be able to test code 
both before and after files were moved around and .errors files added.

So a new tool, UpdateTaskLogs was added that can rebuild all generated 
files from scratch and allows dealing with upgrades and downgrades. It's 
introduced before any changes to the way the logs are handled and 
updated to perform the required renames and to rebuild the generated 
files. This allows upgrading the content of the latest/ and jobs/ 
directories when needed. Also when reverting to an earlier version of 
the code it can mostly restore the content of these directories to their 
original state, thus making it possible to test the old code. And should 
things go wrong on the production server this will also provide a way to 
roll back the state of the TestBot.

The TestBot engine redirects the WineRun* scripts stderr to log.err in 
order to capture Perl errors. This makes log.err different from all 
other .err files since it cannot be rebuilt from the content of the 
corresponding log file. So the patchset renames log and log.err to 
task.log and testbot.log respectively. Of course UpdateTaskLogs deals 
with the needed renames on upgrade. This is also the reason for the 
'mostly' in the previous section: UpdateTaskLogs does not revert these 
renames on downgrades (but that's easy to deal with with a find -exec 
command).

Another issue is that the WineRun* scripts obviously cannot generate the 
.errors file for testbot.log. So instead this file is generated by 
WineSendLog which is called as soon as the overall job is complete. 
Normally testbot.log is empty and thus ignored anyway. Also the 
JobDetails page needs to deal with cases where the .errors file is 
missing in any case. So when that happens it simply points the user to 
the full log.

Finally, if we have two WineTest reports, one with errors A and B, and 
an older one with only error B, then the second report is useless as far 
as detecting new failures go. So one can speed up new failure detection 
by only keeping the useful reports. With the new/old status being stored 
in the .errors file this is not really crucial since it only slows down 
the WineRun* scripts. The patchset still includes a patch to prune the 
redundant WineTest reports. However it's not very effective because some 
tests produce errors that change with every run. Those will clearly need 
to be fixed.

Francois Gouget (25):
  testbot: Standardize building the reference report filenames.
  testbot/LogUtils: Don't rename the old logs on Engine startup.
  testbot/UpdateTaskLogs: Delete, rebuild or add reference reports.
  testbot/LogUtils: Tag new errors instead of filtering them.
  testbot/LogUtils: Return the log errors as a single object.
  testbot/LogUtils: Add _AddLogError() and _AddLogGroup().
  testbot/LogUtils: Rename the $Summary variables to $LogInfo.
  testbot/LogUtils: Tweak ParseTaskLog() to reduce the indentation
    level.
  testbot/LogUtils: Return the ParseWineTestReport() results as a single
    object.
  testbot/LogUtils: Add BadRef/NoRef to TagNewErrors().
  testbot/LogUtils: Rename the ParseTaskLog() NoLog error field to
    BadLog.
  testbot/JobDetails: Highlight new errors in the full log too.
  testbot/LogUtils: Keep track of the first line of each error group.
  testbot/LogUtils: Store the filename in the $LogInfo structures.
  testbot: Share code to access the latest reference WineTest reports.
  testbot/UpdateTaskLogs: Delete reference reports that are not supposed
    to exist.
  testbot: Move reference logs to the task and prepare for having more
    than one.
  testbot: Separate the per-task task and testbot logs.
  testbot/LogUtils: Reorder the functions to get a more logical
    grouping.
  testbot: Cache the log error lists in .errors files.
  testbot/JobDetails: Improve the error reporting.
  testbot/LogUtils: Share code between _SaveLogErrors() and
    _DumpErrors().
  testbot/LogUtils: Return the errors in the Parse*() $LogInfo
    structure.
  testbot: Take into account multiple WineTest reports.
  testbot/LogUtils: Deduplicate the latest WineTest reports.

 testbot/bin/Engine.pl               |   14 -
 testbot/bin/Janitor.pl              |    9 +-
 testbot/bin/UpdateTaskLogs          |  559 +++++++++++++
 testbot/bin/WineRunBuild.pl         |   49 +-
 testbot/bin/WineRunReconfig.pl      |   24 +-
 testbot/bin/WineRunTask.pl          |   78 +-
 testbot/bin/WineRunWineTest.pl      |  118 +--
 testbot/bin/WineSendLog.pl          |  143 ++--
 testbot/lib/WineTestBot/LogUtils.pm | 1161 ++++++++++++++++++---------
 testbot/lib/WineTestBot/Tasks.pm    |   50 +-
 testbot/web/JobDetails.pl           |  146 +++-
 testbot/web/WineTestBot.css         |    2 +
 12 files changed, 1700 insertions(+), 653 deletions(-)
 create mode 100755 testbot/bin/UpdateTaskLogs

-- 
2.20.1