Details
-
Task
-
Resolution: Unresolved
-
P2: Important
-
None
-
None
-
None
Description
Our current metrics collection for coin focuses on macro statistics and OS-reported data. While this has helped monitor the overall system health at times, it has not given good insight into coin internals and the system still hiccups on unknown processes for unclear reasons.
To help with this, we can wrap a number of functions in coin, both in the server and agent, to report execution time metrics, intermittent action fail counts, RPC latency, and so on using Prometheus. Stats are reported to a Prometheus server which provides timeseries data about these low-level statistics.
This task requires installation of new software and creation of new VMs, to be tracked in other tickets. This ticket focuses solely on the development of a metrics module for coin in both Golang and Python, as needed.