Monitor all kinds of health statistics for all our build and test VMs. Requirements:
- Install a monitoring utility to all of our Tier2 images
- Telegraf? it's the one already in use for the host machines.
- Must be able to run custom monitoring commands on custom intervals, for example "ioping" on a custom directory, in order to measure the I/O latency.
- Send all statistics to a remote database
- InfluxDB most likely, as it's already used for recording the host machines metrics
- Make sure the VMs don't cache any metrics, but send them directly, as the build VMs are by definition short lived - they can be killed the moment something goes wrong, but we definitely don't want to miss those metrics
- Data retention on the database is of secondary importance; it's OK to delete logs after a month or even only a week.
- We'll most likely need to assign a unique hostname to each build VM in Coin.