Uploaded image for project: 'Coin'
  1. Coin
  2. COIN-787

In some cases Coin should retry in a new VM, before giving up the whole integration

    XMLWordPrintable

Details

    • Suggestion
    • Resolution: Unresolved
    • P2: Important
    • None
    • None
    • Other

    Description

      If Coin can tell for sure that the build failed because of VM flakiness, then it should spawn a new VM and re-run the specific workitem just once again, before causing the whole integration to fail.

      For example see this log: 17 minutes after the workitem started, coin killed it because of 15 minutes timeout. We can generalise the rule:

      • If the agent kills the build because of timeout error within 5min+timeout since the start of the build, then spawn a new VM on a different host and re-run.

      More criteria can be added in the future. Besides saving time and resources by avoiding re-running hundreds of workitems, this will also give us an automated way to recognize flakiness because of CI factors, and not because of code. For example:

      • If the re-run workitem mentioned above succeeds, then save all details of the previously failed workitem in influx and mark it as "flaky CI". This will give us many datapoints to investigate.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              jujokini Jukka Jokiniva
              jimis Dimitrios Apostolou
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes