[PATCH 1/2] testbot: Make the job scheduler more extensible.

Sun Feb 18 21:25:04 CST 2018

- The new scheduler splits the work into smaller steps: assessing the
  current situation, starting tasks on idle VMs and building a list of
  needed VMs, reverting the VMs, powering off the remaining VMs,
  updating the activity records. Each part can be analyzed
  independently. It uses the $Sched structure to pass information
  between these functions.
- Sometimes a VM that no Task needs must be powered off in order to be
  able to prepare a VM that is needed. The job of picking which VM to
  sacrifice is now delegated to _SacrificeVM().
- The scheduler used to handle each VM host independently. As a result
  it was unable to prepare a VM for the 'next step' if that VM was on
  another VM host. The lack of a global picture also made many other
  extensions impossible. The new scheduler handles scheduling on all
  VM hosts at the same time, thus solving this issue.
- This and other improvements mean the scheduler no longer needs to loop
  over the jobs and tasks multiple times.
- In order to respect the per-VM-host limits the scheduler stores the
  host-related counters and limits in the $Sched->{hosts} table.
- The scheduler also used to build multiple lists of VMs to revert
  depending on whether they were needed now, for the next step, or for
  future jobs. The new scheduler builds a single prioritised list of
  VMs to revert which can be handled in one go. It also keeps more
  information so it can better decide which VM to prepare next.
- The scheduler can now also prepare VMs for the 'next step' earlier,
  thus making it more likely they will be ready in time.

Signed-off-by: Francois Gouget <fgouget at codeweavers.com>
---
 testbot/lib/WineTestBot/Jobs.pm | 914 ++++++++++++++++++++++++++++------------
 1 file changed, 637 insertions(+), 277 deletions(-)

diff --git a/testbot/lib/WineTestBot/Jobs.pm b/testbot/lib/WineTestBot/Jobs.pm
index 4ed6e03ff..036222110 100644
--- a/testbot/lib/WineTestBot/Jobs.pm
+++ b/testbot/lib/WineTestBot/Jobs.pm
@@ -429,6 +429,26 @@ sub CompareJobPriority
   return $a->Priority <=> $b->Priority || $a->Id <=> $b->Id;
 }
 
+=pod
+=over 12
+
+=item C<CheckJobs()>
+
+Goes through the list of Jobs and updates their status. As a side-effect this
+detects failed builds, dead child processes, etc.
+
+=back
+=cut
+
+sub CheckJobs()
+{
+  my $Jobs = CreateJobs();
+  $Jobs->AddFilter("Status", ["queued", "running"]);
+  map { $_->UpdateStatus(); } @{$Jobs->GetItems()};
+
+  return undef;
+}
+
 sub min(@)
 {
   my $m = shift @_;
@@ -436,304 +456,658 @@ sub min(@)
   return $m;
 }
 
+sub _GetSchedHost($$)
+{
+  my ($Sched, $VM) = @_;
+
+  my $HostKey = $VM->GetHost();
+  if (!$Sched->{hosts}->{$HostKey})
+  {
+    $Sched->{hosts}->{$HostKey} = {
+      active => 0,
+      idle => 0,
+      reverting => 0,
+      sleeping => 0,
+      running => 0,
+      dirty => 0,
+      MaxRevertingVMs => $MaxRevertingVMs,
+      MaxRevertsWhileRunningVMs => $MaxRevertsWhileRunningVMs,
+      MaxActiveVMs => $MaxActiveVMs,
+      MaxVMsWhenIdle => $MaxVMsWhenIdle,
+    };
+  }
+  return $Sched->{hosts}->{$HostKey};
+}
+
+sub _GetMaxReverts($)
+{
+  my ($Host) = @_;
+  return ($Host->{running} > 0) ?
+         $Host->{MaxRevertsWhileRunningVMs} :
+         $Host->{MaxRevertingVMs};
+}
+
 =pod
 =over 12
 
-=item C<ScheduleOnHost()>
+=item C<_CanScheduleOnVM()>
 
-This manages the VMs and WineTestBot::Task objects corresponding to the
-hypervisors of a given host. To stay within the host's resource limits the
-scheduler must take the following constraints into account:
-=over
+Checks if a task or VM operation can be performed on the specified VM.
 
-=item *
+=back
+=cut
 
-Jobs should be run in decreasing order of priority.
+sub _CanScheduleOnVM($$)
+{
+  my ($Sched, $VM) = @_;
 
-=item *
+  return 1 if ($VM->Status eq "off");
 
-A Job's Steps must be run in sequential order.
+  # If the VM is busy it cannot be taken over for a new task
+  my $VMKey = $VM->GetKey();
+  return 0 if ($Sched->{busyvms}->{$VMKey});
+
+  # A process may be working on the VM even though it is not busy (e.g. if it
+  # is sleeping). In that case just wait.
+  return !$VM->ChildPid;
+}
+
+=pod
+=over 12
+
+=item C<_CheckAndClassifyVMs()>
+
+Checks the VMs state consistency, counts the VMs in each state and classifies
+them.
+
+=over
 
 =item *
 
-A Step's tasks can be run in parallel but only one task can be running in a VM
-at a given time. Also a VM must be prepared before it can run its task, see the
-VM Statuses.
+Checks that each VM's state is consistent and fixes the VM state if not. For
+instance, if Status == running then the VM should have a child process. If
+there is no such process, or if it died, then the VM should be brought back
+to a coherent state, typically by marking it dirty so it is either powered off
+or reverted.
 
 =item *
 
-The number of active VMs on the host must be kept under $MaxActiveVMs. This
-includes any VM using resources, including those that are being reverted. The
-rational behind this limit is that the host may not be able to run more VMs
-simultaneously, typically due to memory or CPU constraints. Also note that
-this limit must be respected even if there is more than one hypervisor running
-on the host.
+Counts the VMs in each state so the scheduler can respect the limits put on the
+number of simultaneous active VMs, reverting VMs, and so on.
 
 =item *
 
-The number of VMs being reverted on the host at a given time must be kept under
-$MaxRevertingVMs, or $MaxRevertsWhileRunningVMs if some VMs are currently
-running tests. This may be set to 1 in case the hypervisor gets confused when
-reverting too many VMs at once.
+Puts the VMs in one of three sets:
+- The set of busyvms.
+  This is the set of VMs that are doing something important, for instance
+  running a Task, and should not be messed with.
+- The set of lambvms.
+  This is the set of VMs that use resources (they are powered on), but are
+  not doing anything important (idle, sleeping and dirty VMs). If the scheduler
+  is hitting the limits but still needs to power on one more VM, it can power
+  off one of these to make room.
+- The set of powered off VMs.
+  These are the VMs which are in neither the busyvms nor the lambvms set. Since
+  they are powered off they are not using resources.
 
 =item *
 
-Once there are no jobs to run anymore the scheduler can prepare up to
-$MaxVMsWhenIdle VMs (or $MaxActiveVMs if not set) for future jobs.
-This can be set to 0 to minimize the TestBot resource usage when idle.
-This can also be set to a value greater than $MaxActiveVMs. Then only
-$MaxActiveVMs tasks will be run simultaneously but the extra idle VMs will be
-kept on standby so they are ready when their turn comes.
+Each VM is given a priority describing the likelihood that it will be needed
+by a future job. When no other VM is running this can be used to decide which
+VMs to start in advance.
 
-=cut
+=back
 
 =back
 =cut
 
-sub ScheduleOnHost($$$$)
+sub _CheckAndClassifyVMs()
 {
-  my ($ScopeObject, $SortedJobs, $Hypervisors, $Records) = @_;
-
-  my $HostVMs = CreateVMs($ScopeObject);
-  $HostVMs->FilterEnabledRole();
-  $HostVMs->FilterHypervisors($Hypervisors);
+  my $Sched = {
+    VMs => CreateVMs(),
+    hosts => {},
+    busyvms => {},
+    lambvms=> {},
+    nicefuture => {},
+    runnable => 0,
+    queued => 0,
+    recordgroups => CreateRecordGroups(),
+  };
+  $Sched->{recordgroup} = $Sched->{recordgroups}->Add();
+  $Sched->{records} = $Sched->{recordgroup}->Records;
+  # Save the new RecordGroup now so its Id is lower than those of the groups
+  # created by the scripts called from the scheduler.
+  $Sched->{recordgroups}->Save();
 
+  my $FoundVMErrors;
   # Count the VMs that are 'active', that is, that use resources on the host,
   # and those that are reverting. Also build a prioritized list of those that
   # are ready to run tests: the idle ones.
-  my ($RevertingCount, $RunningCount, $IdleCount) = (0, 0, 0);
-  my (%VMPriorities, %IdleVMs, @DirtyVMs);
-  foreach my $VM (@{$HostVMs->GetItems()})
+  foreach my $VM (@{$Sched->{VMs}->GetItems()})
   {
     my $VMKey = $VM->GetKey();
-    my $VMStatus = $VM->Status;
-    if ($VMStatus eq "reverting")
+    if (!$VM->HasEnabledRole())
     {
-      $RevertingCount++;
+      # Don't schedule anything on this VM and otherwise ignore it
+      $Sched->{busyvms}->{$VMKey} = 1;
+      next;
     }
-    elsif ($VMStatus eq "running")
-    {
-      $RunningCount++;
-    }
-    elsif ($VMStatus eq "offline")
+
+    my $Host = _GetSchedHost($Sched, $VM);
+    if ($VM->HasRunningChild())
     {
-      if (!$VM->HasRunningChild())
+      if ($VM->Status =~ /^(?:dirty|running|reverting)$/)
       {
-        $VM->RecordResult($Records, "boterror process died");
-        my $ErrMessage = $VM->RunMonitor();
-        return $ErrMessage if (defined $ErrMessage);
+        $Sched->{busyvms}->{$VMKey} = 1;
+        $Host->{$VM->Status}++;
+        $Host->{active}++;
+      }
+      elsif ($VM->Status eq "sleeping")
+      {
+        # Note that in the case of powered off VM snapshots, a sleeping VM is
+        # in fact booting up thus taking CPU and I/O resources.
+        # So don't count it as idle.
+        $Sched->{lambvms}->{$VMKey} = 1;
+        $Host->{sleeping}++;
+        $Host->{active}++;
+      }
+      elsif ($VM->Status eq "offline")
+      {
+        # The VM cannot be used until it comes back online
+        $Sched->{busyvms}->{$VMKey} = 1;
+      }
+      elsif ($VM->Status eq "maintenance")
+      {
+        # Maintenance VMs should not have a child process!
+        $FoundVMErrors = 1;
+        $VM->KillChild();
+        $VM->Save();
+        $VM->RecordResult($Sched->{records}, "boterror unexpected process");
+        # And the scheduler should not touch them
+        $Sched->{busyvms}->{$VMKey} = 1;
+      }
+      elsif ($VM->Status =~ /^(?:idle|off)$/)
+      {
+        # idle and off VMs should not have a child process!
+        # Mark the VM dirty so a poweroff or revert brings it to a known state.
+        $FoundVMErrors = 1;
+        $VM->KillChild();
+        $VM->Status("dirty");
+        $VM->Save();
+        $VM->RecordResult($Sched->{records}, "boterror unexpected process");
+        $Sched->{lambvms}->{$VMKey} = 1;
+        $Host->{dirty}++;
+        $Host->{active}++;
+      }
+      else
+      {
+        require WineTestBot::Log;
+        WineTestBot::Log::LogMsg("Unexpected $VMKey status ". $VM->Status ."\n");
+        $FoundVMErrors = 1;
+        # Don't interfere with this VM
+        $Sched->{busyvms}->{$VMKey} = 1;
       }
     }
     else
     {
-      my $Priority = $VM->Type eq "build" ? 10 :
-                     $VM->Role ne "base" ? 0 :
-                     $VM->Type eq "win32" ? 1 : 2;
-      $VMPriorities{$VMKey} = $Priority;
-
-      # Consider sleeping VMs to be 'almost idle'. We will check their real
-      # status before starting a job on them anyway. But if there is no such
-      # job, then they are expandable just like idle VMs.
-      if ($VMStatus eq "idle" || $VMStatus eq "sleeping")
+      if (defined $VM->ChildPid or
+          $VM->Status =~ /^(?:running|reverting|sleeping)$/)
       {
-        $IdleCount++;
-        $IdleVMs{$VMKey} = 1;
+        # The VM is missing its child process or it died unexpectedly. Mark
+        # the VM dirty so a revert or shutdown brings it back to a known state.
+        $FoundVMErrors = 1;
+        $VM->ChildPid(undef);
+        $VM->Status("dirty");
+        $VM->Save();
+        $VM->RecordResult($Sched->{records}, "boterror process died");
+        $Sched->{lambvms}->{$VMKey} = 1;
+        $Host->{dirty}++;
+        $Host->{active}++;
       }
-      elsif ($VMStatus eq "dirty")
+      elsif ($VM->Status =~ /^(?:dirty|idle)$/)
       {
-        push @DirtyVMs, $VMKey;
+        $Sched->{lambvms}->{$VMKey} = 1;
+        $Host->{$VM->Status}++;
+        $Host->{active}++;
       }
+      elsif ($VM->Status eq "offline")
+      {
+        my $ErrMessage = $VM->RunMonitor();
+        return ($ErrMessage, undef) if (defined $ErrMessage);
+        # Ignore the VM for this round since we cannot use it
+        $Sched->{busyvms}->{$VMKey} = 1;
+      }
+      elsif ($VM->Status eq "maintenance")
+      {
+        # Don't touch the VM while the administrator is working on it
+        $Sched->{busyvms}->{$VMKey} = 1;
+      }
+      elsif ($VM->Status ne "off")
+      {
+        require WineTestBot::Log;
+        WineTestBot::Log::LogMsg("Unexpected $VMKey status ". $VM->Status ."\n");
+        $FoundVMErrors = 1;
+        # Don't interfere with this VM
+        $Sched->{busyvms}->{$VMKey} = 1;
+      }
+      # Note that off VMs are neither in busyvms nor lambvms
     }
+
+    $Sched->{nicefuture}->{$VMKey} =
+        ($VM->Role eq "base" ? 0 :
+         $VM->Role eq "winetest" ? 10 :
+         20) + # extra
+        ($VM->Type eq "build" ? 0 :
+         $VM->Type eq "win64" ? 1 :
+         2); # win32
+  }
+
+  # If a VM was in an inconsistent state, update the jobs status fields before
+  # continuing with the scheduling.
+  CheckJobs() if ($FoundVMErrors);
+
+  return (undef, $Sched);
+}
+
+=pod
+=over 12
+
+=item C<_AddNeededVM()>
+
+Adds the specified VM to the list of VMs needed by queued tasks, together with
+priority information. The priority information is stored in an array which
+contains:
+
+=over
+
+=item [0]
+
+The VM's position in the Jobs list. Newer jobs give precedence to older ones.
+Note that the position within a job ($Step->No and $Task->No) does not matter.
+What counts is getting the job results to the developer.
+
+=item [1]
+
+The VM Status: dirty VMs are given a small priority boost since they are
+likely to already be in the host's memory.
+
+=item [2]
+
+The number of Tasks that need the VM. Give priority to VMs that are needed by
+more Tasks so we don't end up in a situation where all the tasks need the same
+VM, which cannot be parallelized.
+
+=item [3]
+
+If the VM is needed for a 'next step', then this lists its dependencies.
+The dependencies are the VMs that are still needed by a task in the current
+step. If any VM in the dependencies list is not yet being prepared to run
+a task, then it is too early to start preparing this VM for the next step.
+
+=back
+
+=back
+=cut
+
+sub _AddNeededVM($$$;$)
+{
+  my ($NeededVMs, $VM, $Niceness, $Dependencies) = @_;
+
+  my $VMKey = $VM->GetKey();
+  if (!$NeededVMs->{$VMKey})
+  {
+    my $Hot = ($VM->Status ne "off") ? 1 : 0;
+    my $PendingReverts = ($VM->Status !~ /^(?:idle|reverting|sleeping)$/) ? 1 : 0;
+    $NeededVMs->{$VMKey} = [$Niceness, $Hot, $PendingReverts, $Dependencies];
+    return 1;
   }
 
-  # It usually takes longer to revert a VM than to run a test. So readyness
-  # (idleness) trumps the Job priority and thus we start jobs on the idle VMs
-  # right away. Then we build a prioritized list of VMs to revert.
-  my (%VMsToRevert, @VMsNext);
-  my ($RevertNiceness, $SleepingCount) = (0, 0);
-  foreach my $Job (@$SortedJobs)
+  # One more task needs this VM
+  $NeededVMs->{$VMKey}->[2]++;
+
+  # Although we process the jobs in decreasing priority order, the VM may
+  # have been added for a 'next step' task and thus with a much increased
+  # niceness and dependencies compared to the jobs that follow.
+  if ($Niceness < $NeededVMs->{$VMKey}->[0])
+  {
+    $NeededVMs->{$VMKey}->[0] = $Niceness;
+    $NeededVMs->{$VMKey}->[3] = $Dependencies;
+    return 1;
+  }
+
+  return 0;
+}
+
+sub _GetNiceness($$)
+{
+  my ($NeededVMs, $VMKey) = @_;
+  return $NeededVMs->{$VMKey}->[0];
+}
+
+sub _CompareNeededVMs($$$)
+{
+  my ($NeededVMs, $VMKey1, $VMKey2) = @_;
+
+  my $Data1 = $NeededVMs->{$VMKey1};
+  my $Data2 = $NeededVMs->{$VMKey2};
+  return $Data1->[0] <=> $Data2->[0] || # Lower niceness jobs first
+         $Data2->[1] <=> $Data1->[1] || # Hot VMs first
+         $Data2->[2] <=> $Data1->[2];   # Needed by more tasks first
+}
+
+sub _HasMissingDependencies($$$)
+{
+  my ($Sched, $NeededVMs, $VMKey) = @_;
+
+  my $Data = $NeededVMs->{$VMKey};
+  return undef if (!$Data->[3]);
+
+  foreach my $DepVM (@{$Data->[3]})
   {
+    return 1 if ($DepVM->Status !~ /^(?:reverting|sleeping|running)$/);
+  }
+  return undef;
+}
+
+my $NEXT_BASE = 1000;
+my $FUTURE_BASE = 2000;
+
+=pod
+=over 12
+
+=item C<_ScheduleTasks()>
+
+Runs the tasks on idle VMs, and builds a list of the VMs that will be needed
+next.
+
+=back
+=cut
+
+sub _ScheduleTasks($)
+{
+  my ($Sched) = @_;
+
+  # The set of VMs needed by the runnable, 'next step' and future tasks
+  my $NeededVMs = {};
+
+  # Process the jobs in decreasing priority order
+  my $JobRank;
+  my $Jobs = CreateJobs($Sched->{VMs});
+  $Jobs->AddFilter("Status", ["queued", "running"]);
+  foreach my $Job (sort CompareJobPriority @{$Jobs->GetItems()})
+  {
+    $JobRank++;
+
+    # The list of VMs that should be getting ready to run
+    # before we prepare the next step
+    my $PreviousVMs = [];
+
+    my $StepRank = 0;
     my $Steps = $Job->Steps;
     $Steps->AddFilter("Status", ["queued", "running"]);
-    my @SortedSteps = sort { $a->No <=> $b->No } @{$Steps->GetItems()};
-    if (@SortedSteps != 0)
+    foreach my $Step (sort { $a->No <=> $b->No } @{$Steps->GetItems()})
     {
-      my $Step = $SortedSteps[0];
-      $Step->HandleStaging();
-      my $PrepareNextStep;
       my $Tasks = $Step->Tasks;
       $Tasks->AddFilter("Status", ["queued"]);
-      my @SortedTasks = sort { $a->No <=> $b->No } @{$Tasks->GetItems()};
-      foreach my $Task (@SortedTasks)
+      $Sched->{queued} += $Tasks->GetItemsCount();
+
+      # StepRank 0 contains the runnable tasks, 1 the 'may soon be runnable'
+      # ones, and 2 and greater tasks we don't care about yet
+      next if ($StepRank >= 2);
+      if ($StepRank == 0)
+      {
+        $Step->HandleStaging() if ($Step->Status eq "queued");
+        $Sched->{runnable} += $Tasks->GetItemsCount();
+      }
+      elsif (!$PreviousVMs)
+      {
+        # The previous step is nowhere near done so skip this one for now
+        next;
+      }
+
+      my $StepVMs = [];
+      foreach my $Task (@{$Tasks->GetItems()})
       {
         my $VM = $Task->VM;
-        my $VMKey = $VM->GetKey();
-        next if (!$HostVMs->ItemExists($VMKey) || exists $VMsToRevert{$VMKey});
+        next if (!$VM->HasEnabledRole() or !$VM->HasEnabledStatus());
 
-        my $VMStatus = $VM->Status;
-        if ($VMStatus eq "idle")
+        if ($StepRank == 1)
         {
+          # Passing $PreviousVMs ensures this VM will be reverted if and only
+          # if all of the previous step's tasks are about to run.
+          # See _HasMissingDependencies().
+          _AddNeededVM($NeededVMs, $VM, $NEXT_BASE + $JobRank, $PreviousVMs);
+          next;
+        }
+
+        if (!_AddNeededVM($NeededVMs, $VM, $JobRank))
+        {
+          # This VM is in $NeededVMs already which means it is already
+          # scheduled to be reverted for a task with a higher priority.
+          # So this task won't be run before a while and thus there is
+          # no point in preparing the next step.
+          $StepVMs = undef;
+          next;
+        }
+
+        # It's not worth preparing the next step for tasks that take so long
+        $StepVMs = undef if ($Task->Timeout > $BuildTimeout);
+
+        my $VMKey = $VM->GetKey();
+        if ($VM->Status eq "idle")
+        {
+          # Most of the time reverting a VM takes longer than running a task.
+          # So if a VM is ready (i.e. idle) we can start the first task we
+          # find for it, even if we could revert another VM to run a higher
+          # priority job.
           # Even if we cannot start the task right away this VM is not a
           # candidate for shutdown since it will be needed next.
-          $IdleVMs{$VMKey} = 0;
+          delete $Sched->{lambvms}->{$VMKey};
 
-          if ($RunningCount < $MaxActiveVMs and
-              ($RevertingCount == 0 || $RevertingCount < $MaxRevertsWhileRunningVMs))
+          my $Host = _GetSchedHost($Sched, $VM);
+          if ($Host->{active} - $Host->{idle} < $Host->{MaxActiveVMs} and
+              ($Host->{reverting} == 0 or
+               $Host->{reverting} <= $Host->{MaxRevertsWhileRunningVMs}))
           {
+            $Sched->{busyvms}->{$VMKey} = 1;
+            $VM->RecordStatus($Sched->{records}, join(" ", "running", $Job->Id, $Step->No, $Task->No));
             my $ErrMessage = $Task->Run($Step);
-            return $ErrMessage if (defined $ErrMessage);
-            $VM->RecordStatus($Records, join(" ", "running", $Job->Id, $Step->No, $Task->No));
+            return ($ErrMessage, undef) if (defined $ErrMessage);
+
             $Job->UpdateStatus();
-            $IdleCount--;
-            $RunningCount++;
-            $PrepareNextStep = 1;
+            $Host->{idle}--;
+            $Host->{running}++;
           }
         }
-        elsif ($VMStatus eq "sleeping" and $IdleVMs{$VMKey})
+        elsif ($VM->Status =~ /^(?:reverting|sleeping)$/)
         {
-          # It's not running jobs yet but soon will be
-          # so it's not a candidate for shutdown or revert.
-          $IdleVMs{$VMKey} = 0;
-          $IdleCount--;
-          $SleepingCount++;
-          $PrepareNextStep = 1;
+          # The VM is not running jobs yet but soon will be so it is not a
+          # candidate for shutdown or sacrifices.
+          delete $Sched->{lambvms}->{$VMKey};
         }
-        elsif (($VMStatus eq "off" or $VMStatus eq "dirty") and
-               !$VM->HasRunningChild())
+        elsif ($VM->Status ne "off" and !$Sched->{lambvms}->{$VMKey})
         {
-          $RevertNiceness++;
-          $VMsToRevert{$VMKey} = $RevertNiceness;
-        }
-      }
-      if ($PrepareNextStep && @SortedSteps >= 2)
-      {
-        # Build a list of VMs we will need next
-        my $Step = $SortedSteps[1];
-        $Tasks = $Step->Tasks;
-        $Tasks->AddFilter("Status", ["queued"]);
-        @SortedTasks = sort { $a->No <=> $b->No } @{$Tasks->GetItems()};
-        foreach my $Task (@SortedTasks)
-        {
-          my $VM = $Task->VM;
-          my $VMKey = $VM->GetKey();
-          push @VMsNext, $VMKey;
-          # If idle already this is not a candidate for shutdown
-          $IdleVMs{$VMKey} = 0;
+          # We cannot use the VM because it is busy (running another task,
+          # offline, etc.). So it is too early to prepare the next step.
+          $StepVMs = undef;
         }
+        push @$StepVMs, $VM if ($StepVMs);
       }
+      $PreviousVMs = $StepVMs;
+      $StepRank++;
     }
   }
 
-  # Figure out how many VMs we will actually be able to revert now and only
-  # keep the highest priority ones.
-  my @SortedVMsToRevert = sort { $VMsToRevert{$a} <=> $VMsToRevert{$b} } keys %VMsToRevert;
-  my $MaxReverts = ($RunningCount > 0) ?
-                   $MaxRevertsWhileRunningVMs : $MaxRevertingVMs;
-  my $ActiveCount = $IdleCount + $RunningCount + $RevertingCount + $SleepingCount + @DirtyVMs;
-  # This is the number of VMs we would revert if idle and dirty VMs did not
-  # stand in the way. And those that do will be shut down.
-  my $RevertableCount = min(scalar(@SortedVMsToRevert),
-                            $MaxReverts - $RevertingCount,
-                            $MaxActiveVMs - ($ActiveCount - $IdleCount - @DirtyVMs));
-  if ($RevertableCount < @SortedVMsToRevert)
+  # Finally add some VMs with a very low priority for future jobs.
+  foreach my $VM (@{$Sched->{VMs}->GetItems()})
   {
-    $RevertableCount = 0 if ($RevertableCount < 0);
-    for (my $i = $RevertableCount; $i < @SortedVMsToRevert; $i++)
-    {
-      my $VMKey = $SortedVMsToRevert[$i];
-      delete $VMsToRevert{$VMKey};
-    }
-    splice @SortedVMsToRevert, $RevertableCount;
+    next if (!$VM->HasEnabledRole() or !$VM->HasEnabledStatus());
+    my $VMKey = $VM->GetKey();
+    my $Niceness = $FUTURE_BASE + $Sched->{nicefuture}->{$VMKey};
+    _AddNeededVM($NeededVMs, $VM, $Niceness);
   }
 
-  # Power off all the VMs that we won't be reverting now so they don't waste
-  # resources while waiting for their turn.
-  foreach my $VMKey (@DirtyVMs)
-  {
-    next if (exists $VMsToRevert{$VMKey});
+  return (undef, $NeededVMs);
+}
 
-    my $VM = $HostVMs->GetItem($VMKey);
-    next if ($VM->Status ne "dirty" or $VM->HasRunningChild());
+=pod
+=over 12
 
-    my $ErrMessage = $VM->RunPowerOff();
-    return $ErrMessage if (defined $ErrMessage);
-    $VM->RecordStatus($Records, "dirty poweroff");
-  }
+=item C<_SacrificeVM()>
 
-  # Power off some idle VMs we don't need immediately so we can revert more
-  # of the VMs we need now.
-  my $PlannedActiveCount = $ActiveCount - @DirtyVMs + @SortedVMsToRevert;
-  if ($IdleCount > 0 && @SortedVMsToRevert > 0 &&
-      $PlannedActiveCount > $MaxActiveVMs)
-  {
-    # Sort from least important to most important
-    my @SortedIdleVMs = sort { $VMPriorities{$a} <=> $VMPriorities{$b} } keys %IdleVMs;
-    foreach my $VMKey (@SortedIdleVMs)
-    {
-      my $VM = $HostVMs->GetItem($VMKey);
-      next if (!$IdleVMs{$VMKey});
+Looks for and powers off a VM we don't need now in order to free resources
+for one we do need now.
 
-      my $ErrMessage = $VM->RunPowerOff();
-      return $ErrMessage if (defined $ErrMessage);
-      $VM->RecordStatus($Records, "dirty poweroff");
-      $PlannedActiveCount--;
-      last if ($PlannedActiveCount <= $MaxActiveVMs);
-    }
-    # The scheduler will be run again when these VMs have been powered off and
-    # then we will do the reverts. In the meantime don't change $ActiveCount.
-  }
+This is a helper for _RevertVMs().
 
-  # Revert the VMs that are blocking jobs
-  foreach my $VMKey (@SortedVMsToRevert)
+=back
+=cut
+
+sub _SacrificeVM($$$)
+{
+  my ($Sched, $NeededVMs, $VM) =@_;
+  my $VMKey = $VM->GetKey();
+  my $Host = _GetSchedHost($Sched, $VM);
+
+  # Grab the lowest priority lamb and sacrifice it
+  my $ForFutureVM = (_GetNiceness($NeededVMs, $VMKey) >= $FUTURE_BASE);
+  my $NiceFuture = $Sched->{nicefuture};
+  my ($Victim, $VictimKey, $VictimStatusPrio);
+  foreach my $CandidateKey (keys %{$Sched->{lambvms}})
   {
-    last if ($RevertingCount == $MaxReverts);
+    my $Candidate = $Sched->{VMs}->GetItem($CandidateKey);
 
-    my $VM = $HostVMs->GetItem($VMKey);
-    next if ($VM->Status eq "off" and $ActiveCount >= $MaxActiveVMs);
+    # Check that the candidate is on the right host
+    my $CandidateHost = _GetSchedHost($Sched, $Candidate);
+    next if ($CandidateHost != $Host);
 
-    delete $VMPriorities{$VMKey};
-    my $ErrMessage = $VM->RunRevert();
-    return $ErrMessage if (defined $ErrMessage);
+    # Don't sacrifice idle / sleeping VMs for future tasks
+    next if ($ForFutureVM and $Candidate->Status =~ /^(?:idle|sleeping)/);
 
-    $RevertingCount++;
-    $ActiveCount++ if ($VM->Status eq "off");
+    # Don't sacrifice more important VMs
+    next if (_CompareNeededVMs($NeededVMs, $CandidateKey, $VMKey) <= 0);
+
+    my $CandidateStatusPrio = $Candidate->Status eq "idle" ? 2 :
+                              $Candidate->Status eq "sleeping" ? 1 :
+                              0; # Status eq dirty
+    if ($Victim)
+    {
+      my $Cmp = $VictimStatusPrio <=> $CandidateStatusPrio ||
+                $NiceFuture->{$CandidateKey} <=> $NiceFuture->{$VictimKey};
+      next if ($Cmp <= 0);
+    }
+
+    $Victim = $Candidate;
+    $VictimKey = $CandidateKey;
+    $VictimStatusPrio = $CandidateStatusPrio;
   }
+  return undef if (!$Victim);
+
+  delete $Sched->{lambvms}->{$VictimKey};
+  $Sched->{busyvms}->{$VictimKey} = 1;
+  $Host->{$Victim->Status}--;
+  $Host->{dirty}++;
+  $Victim->RecordStatus($Sched->{records}, $Victim->Status eq "dirty" ? "dirty poweroff" : "dirty sacrifice");
+  $Victim->RunPowerOff();
+  return 1;
+}
 
-  # Prepare some VMs for the next step of the current jobs
-  foreach my $VMKey (@VMsNext)
+sub _RevertVMs($$)
+{
+  my ($Sched, $NeededVMs) = @_;
+
+  # Sort the VMs that tasks need by decreasing priority order and revert them
+  my @SortedNeededVMs = sort { _CompareNeededVMs($NeededVMs, $a, $b) } keys %{$NeededVMs};
+  foreach my $VMKey (@SortedNeededVMs)
   {
-    last if ($RevertingCount == $MaxReverts);
-    last if ($ActiveCount >= $MaxActiveVMs);
+    my $VM = $Sched->{VMs}->GetItem($VMKey);
+    my $VMStatus = $VM->Status;
+    next if ($VMStatus eq "idle");
 
-    my $VM = $HostVMs->GetItem($VMKey);
-    next if ($VM->Status ne "off");
+    # Check if the host has reached its reverting VMs limit
+    my $Host = _GetSchedHost($Sched, $VM);
+    next if ($Host->{reverting} >= _GetMaxReverts($Host));
 
-    my $ErrMessage = $VM->RunRevert();
-    return $ErrMessage if (defined $ErrMessage);
-    $RevertingCount++;
-    $ActiveCount++;
-  }
+    # Skip this VM if the previous step's tasks are not about to run yet
+    next if (_HasMissingDependencies($Sched, $NeededVMs, $VMKey));
+    next if (!_CanScheduleOnVM($Sched, $VM));
 
-  # Finally, if we are otherwise idle, prepare some VMs for future jobs
-  if ($ActiveCount == $IdleCount && $ActiveCount < $MaxVMsWhenIdle)
-  {
-    # Sort from most important to least important
-    my @SortedVMs = sort { $VMPriorities{$b} <=> $VMPriorities{$a} } keys %VMPriorities;
-    foreach my $VMKey (@SortedVMs)
+    my $NeedsSacrifice;
+    if (_GetNiceness($NeededVMs, $VMKey) >= $FUTURE_BASE)
     {
-      last if ($RevertingCount == $MaxReverts);
-      last if ($ActiveCount >= $MaxVMsWhenIdle);
-
-      my $VM = $HostVMs->GetItem($VMKey);
-      next if ($VM->Status ne "off");
+      if (!exists $Host->{isidle})
+      {
+        # Only start preparing VMs for future jobs on a host which is idle.
+        # FIXME As a proxy we currently check that the host only has idle VMs.
+        # This is a bad proxy because:
+        # - The host could still have pending tasks for a 'next step'. Once
+        #   those get closer to running, preparing those would be better than
+        #   preparing future VMs.
+        # - Checking there are no queued tasks on that host would be better
+        #   but this information is not available on a per-host basis.
+        # - Also the number of queued tasks includes tasks scheduled to run
+        #   on maintenance and retired/deleted VMs. Any such task would prevent
+        #   preparing future VMs for no good reason.
+        # - It forces the host to go through an extra poweroff during which we
+        #   lose track of which VM is 'hot'.
+        # - However on startup this helps ensure that we are not prevented
+        #   from preparing the best VM (e.g. build) just because it is still
+        #   being checked (i.e. marked dirty).
+        $Host->{isidle} = ($Host->{active} == $Host->{idle});
+      }
+      if (!$Host->{isidle} or $Host->{MaxVMsWhenIdle} == 0)
+      {
+        # The TestBot is busy or does not prepare VMs when idle
+        next;
+      }
+      # To not exceed the limit we must take into account VMs that are not yet
+      # idle but will soon be.
+      my $FutureIdle = $Host->{idle} + $Host->{reverting} + $Host->{sleeping} + ($VMStatus eq "off" ? 1 : 0);
+      $NeedsSacrifice = ($FutureIdle > $Host->{MaxVMsWhenIdle});
+    }
+    else
+    {
+      my $FutureActive = $Host->{active} + ($VMStatus eq "off" ? 1 : 0);
+      $NeedsSacrifice = ($FutureActive > $Host->{MaxActiveVMs});
+    }
 
+    if ($NeedsSacrifice)
+    {
+      # Find an active VM to sacrifice so we can revert this VM in the next
+      # scheduler round
+      last if (!_SacrificeVM($Sched, $NeededVMs, $VM));
+      delete $Sched->{lambvms}->{$VMKey};
+      # The $Host counters must account for the coming revert. This means
+      # active is unchanged: -1 for the sacrificed VM and +1 for the revert.
+      $Host->{reverting}++;
+    }
+    else
+    {
+      delete $Sched->{lambvms}->{$VMKey};
+      $Sched->{busyvms}->{$VMKey} = 1;
       my $ErrMessage = $VM->RunRevert();
       return $ErrMessage if (defined $ErrMessage);
-      $RevertingCount++;
-      $ActiveCount++;
+      $Host->{active}++ if ($VMStatus eq "off");
+      $Host->{reverting}++;
     }
   }
+  return undef;
+}
 
+sub _PowerOffDirtyVMs($)
+{
+  my ($Sched) = @_;
+
+  # Power off any still dirty VM
+  foreach my $VMKey (keys %{$Sched->{lambvms}})
+  {
+    my $VM = $Sched->{VMs}->GetItem($VMKey);
+    next if ($VM->Status ne "dirty");
+
+    $VM->RecordStatus($Sched->{records}, "dirty poweroff");
+    my $ErrMessage = $VM->RunPowerOff();
+    return $ErrMessage if (defined $ErrMessage);
+  }
   return undef;
 }
 
@@ -744,106 +1118,92 @@ my $_LastTaskCounts = "";
 
 =item C<ScheduleJobs()>
 
-Goes through the WineTestBot hosts and schedules the Job tasks on each of
-them using WineTestBot::Jobs::ScheduleOnHost().
+Goes through the pending Jobs to run their queued Tasks. This implies preparing
+the VMs while staying within the VM hosts resource limits. In particular this
+means taking the following constraints into account:
+
+=over
+
+=item *
+
+Jobs should be run in decreasing order of priority.
+
+=item *
+
+A Job's Steps must be run in sequential order.
+
+=item *
+
+A Step's tasks can be run in parallel but only one task can be running in a VM
+at a given time. Also a VM must be prepared before it can run its task, see the
+VM Statuses.
+
+=item *
+
+The number of active VMs on the host must be kept under $MaxActiveVMs. Any
+VM using resources counts as an active VM, including those that are being
+reverted. This limit is meant to ensure the VM host will have enough memory,
+CPU or I/O resources for all the active VMs. Also note that this limit must be
+respected even if there is more than one hypervisor running on the host.
+
+=item *
+
+The number of VMs being reverted on the host at a given time must be kept under
+$MaxRevertingVMs, or $MaxRevertsWhileRunningVMs if some VMs are currently
+running tests. This may be set to 1 in case the hypervisor gets confused when
+reverting too many VMs at once.
+
+=item *
+
+Once there are no jobs to run anymore the scheduler can prepare up to
+$MaxVMsWhenIdle VMs (or $MaxActiveVMs if not set) for future jobs.
+This can be set to 0 to minimize the TestBot resource usage when idle.
+This can also be set to a value greater than $MaxActiveVMs. Then only
+$MaxActiveVMs tasks will be run simultaneously but the extra idle VMs will be
+kept on standby so they are ready when their turn comes.
+
+=back
 
 =back
 =cut
 
 sub ScheduleJobs()
 {
-  my $RecordGroups = CreateRecordGroups();
-  my $RecordGroup = $RecordGroups->Add();
-  my $Records = $RecordGroup->Records;
-  # Save the new RecordGroup now so its Id is lower than those of the groups
-  # created by the scripts called from the scheduler.
-  $RecordGroups->Save();
+  my ($ErrMessage, $Sched) = _CheckAndClassifyVMs();
+  return $ErrMessage if ($ErrMessage);
 
-  my $Jobs = CreateJobs();
-  $Jobs->AddFilter("Status", ["queued", "running"]);
-  my @SortedJobs = sort CompareJobPriority @{$Jobs->GetItems()};
-  # Note that even if there are no jobs to schedule
-  # we should check if there are VMs to revert
+  my $NeededVMs;
+  ($ErrMessage, $NeededVMs) = _ScheduleTasks($Sched);
+  return $ErrMessage if ($ErrMessage);
 
-  # Count the runnable and queued tasks for the record
-  my ($RunnableTasks, $QueuedTasks) = (0, 0);
-  foreach my $Job (@SortedJobs)
-  {
-    my $Steps = $Job->Steps;
-    $Steps->AddFilter("Status", ["queued", "running"]);
-    my @SortedSteps = sort { $a->No <=> $b->No } @{$Steps->GetItems()};
-    next if (!@SortedSteps);
+  $ErrMessage = _RevertVMs($Sched, $NeededVMs);
+  return $ErrMessage if ($ErrMessage);
 
-    my $Tasks = $SortedSteps[0]->Tasks;
-    $Tasks->AddFilter("Status", ["queued"]);
-    $RunnableTasks += @{$Tasks->GetItems()};
-
-    foreach my $Step (@SortedSteps)
-    {
-      my $Tasks = $Step->Tasks;
-      $Tasks->AddFilter("Status", ["queued"]);
-      $QueuedTasks += scalar(@{$Tasks->GetItems()});
-    }
-  }
-
-  my %Hosts;
-  my $VMs = CreateVMs($Jobs);
-  $VMs->FilterEnabledRole();
-  foreach my $VM (@{$VMs->GetItems()})
-  {
-    my $Host = $VM->GetHost();
-    $Hosts{$Host}->{$VM->VirtURI} = 1;
-  }
-
-  my @ErrMessages;
-  foreach my $Host (keys %Hosts)
-  {
-    my @HostHypervisors = keys %{$Hosts{$Host}};
-    my $HostErrMessage = ScheduleOnHost($Jobs, \@SortedJobs, \@HostHypervisors, $Records);
-    push @ErrMessages, $HostErrMessage if (defined $HostErrMessage);
-  }
+  $ErrMessage = _PowerOffDirtyVMs($Sched);
+  return $ErrMessage if ($ErrMessage);
 
   # Note that any VM Status or Role change will trigger ScheduleJobs() so this
-  # records all VM state changes.
-  $VMs = CreateVMs();
-  map { $_->RecordStatus($Records) } (@{$VMs->GetItems()});
-  if (@{$Records->GetItems()})
+  # records all not yet recorded VM state changes, even those not initiated by
+  # the scheduler.
+  map { $_->RecordStatus($Sched->{records}) } @{$Sched->{VMs}->GetItems()};
+
+  if (@{$Sched->{records}->GetItems()})
   {
     # FIXME Add the number of tasks scheduled to run on a maintenance, retired
     #       or deleted VM...
-    my $TaskCounts = "$RunnableTasks $QueuedTasks 0";
+    my $TaskCounts = "$Sched->{runnable} $Sched->{queued} 0";
     if ($TaskCounts ne $_LastTaskCounts)
     {
-      $Records->AddRecord('tasks', 'counters', $TaskCounts);
+      $Sched->{records}->AddRecord('tasks', 'counters', $TaskCounts);
       $_LastTaskCounts = $TaskCounts;
     }
-    $RecordGroups->Save();
+    $Sched->{recordgroups}->Save();
   }
   else
   {
-    $RecordGroups->DeleteItem($RecordGroup);
+    $Sched->{recordgroups}->DeleteItem($Sched->{recordgroup});
   }
 
-  return @ErrMessages ? join("\n", @ErrMessages) : undef;
-}
-
-=pod
-=over 12
-
-=item C<CheckJobs()>
-
-Goes through the list of Jobs and updates their status. As a side-effect this
-detects failed builds, dead child processes, etc.
-
-=back
-=cut
-
-sub CheckJobs()
-{
-  my $Jobs = CreateJobs();
-  $Jobs->AddFilter("Status", ["queued", "running"]);
-  map { $_->UpdateStatus(); } @{$Jobs->GetItems()};
-
   return undef;
 }
 
-- 
2.16.1