Loading...

XML

Word

Printable

Details

Type: Suggestion
Resolution: Unresolved
Priority: P2: Important
Fix Version/s: None
Affects Version/s: None
Component/s: Other
Labels:
- Flakiness

Description

If Coin can tell for sure that the build failed because of VM flakiness, then it should spawn a new VM and re-run the specific workitem just once again, before causing the whole integration to fail.

For example see this log: 17 minutes after the workitem started, coin killed it because of 15 minutes timeout. We can generalise the rule:

If the agent kills the build because of timeout error within 5min+timeout since the start of the build, then spawn a new VM on a different host and re-run.

More criteria can be added in the future. Besides saving time and resources by avoiding re-running hundreds of workitems, this will also give us an automated way to recognize flakiness because of CI factors, and not because of code. For example:

If the re-run workitem mentioned above succeeds, then save all details of the previously failed workitem in influx and mark it as "flaky CI". This will give us many datapoints to investigate.

Attachments

Issue Links

relates to

COIN-904 Integration should not be returned to gerrit because of CI internal failure

Open

COIN-890 Retry workitem if it was cancelled because of timeout

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Jukka Jokiniva

Reporter:: Dimitrios Apostolou

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17 Dec '21 04:12

Updated:: 09 Sep '22 13:20

Gerrit Reviews

There are no open Gerrit changes