If Coin can tell for sure that the build failed because of VM flakiness, then it should spawn a new VM and re-run the specific workitem just once again, before causing the whole integration to fail.
For example see this log: 17 minutes after the workitem started, coin killed it because of 15 minutes timeout. We can generalise the rule:
- If the agent kills the build because of timeout error within 5min+timeout since the start of the build, then spawn a new VM on a different host and re-run.
More criteria can be added in the future. Besides saving time and resources by avoiding re-running hundreds of workitems, this will also give us an automated way to recognize flakiness because of CI factors, and not because of code. For example:
- If the re-run workitem mentioned above succeeds, then save all details of the previously failed workitem in influx and mark it as "flaky CI". This will give us many datapoints to investigate.