Details
-
Bug
-
Resolution: Won't Do
-
P2: Important
-
None
-
unversioned
-
None
Description
Definition of done: When host goes down, it should automatically reboot and not require any manual steps
Currently VM host crash is detected only by this process:
- VM host crashes
- After 5h developer sees "Timeout" in Coin
- Developer does restage and is happy
- Developer 2 sees "Timeout" in Coin
- Developer 2 does restage and is happy
- Developer 3 sees "Timeout" in Coin
- Developer 3 asks in IRC about the timeouts
- Someone from CI restarts the host
What should happen is:
- VM host crashes
- Coin detects that the host has crashed and restarts work items that were running on it (
QTQAINFRA-1749) - Automatic monitoring detects the host is down and restarts the host (
QTQAINFRA-1754) - Root cause of the problem is diagnosed/categorized and reported to CI operators (
QTQAINFRA-1778)
Definition of done for this ticket: Automatic monitoring detects the host is down and restarts the host
Attachments
Issue Links
- relates to
-
QTQAINFRA-1778 VM host crash reasons should be automatically categorized
- Closed
-
QTQAINFRA-1749 Coin should monitor the state of the hosts in ONE
- Closed
-
COIN-157 Ideas how to make CI more reliable
- Closed