Uploaded image for project: 'Coin'
  1. Coin
  2. COIN-787

In some cases Coin should retry in a new VM, before giving up the whole integration

    XMLWordPrintable

    Details

    • Type: Suggestion
    • Status: Reported
    • Priority: Not Evaluated
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Other
    • Labels:

      Description

      If Coin can tell for sure that the build failed because of VM flakiness, then it should spawn a new VM and re-run the specific workitem just once again, before causing the whole integration to fail.

      For example see this log: 17 minutes after the workitem started, coin killed it because of 15 minutes timeout. We can generalise the rule:

      • If the agent kills the build because of timeout error within 5min+timeout since the start of the build, then spawn a new VM on a different host and re-run.

      More criteria can be added in the future. Besides saving time and resources by avoiding re-running hundreds of workitems, this will also give us an automated way to recognize flakiness because of CI factors, and not because of code. For example:

      • If the re-run workitem mentioned above succeeds, then save all details of the previously failed workitem in influx and mark it as "flaky CI". This will give us many datapoints to investigate.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            Assignee:
            hehalmet Heikki Halmet
            Reporter:
            jimis Dimitrios Apostolou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:

                Gerrit Reviews

                There are no open Gerrit changes