Details
-
Task
-
Resolution: Unresolved
-
Not Evaluated
-
None
-
None
-
None
Description
Influx db on testresults still serves as the main db for displaying testresults in grafana (provides data for fastcheck, slowcheck, flaky failed and blacklisted dashboards). While we move to posgreSQL we still use and need to maintain it.
The influx instance allows for the existence of several logical databases that can be shown using "show database" query. The grafana dashboards use coin and coin_extra databases and probably other databases related to benchmarks.
> show databases
name: databases
name
_internal
demo
test
feature_system
qmlbench -> should be cleaned see comment
eriks_playground
qtest_benchmarks
coin
erik_test
creator_benchmarks
core_benchmarks
qt3dstudio_runtime_benchmarks
QtWaylandTests
qmlbench_NDA_devices
3dstudio
qtquick3dTests
coin_test
qmlbench_archive -> to be moved somewhere else
restage_statistics
restage_statistics2
coin_extra
coin_capacity
audun_playground
juho_test
qmlbench_boot2qt
jahelaak_personal
Recently we had a situation when the influx ran out of the memory and was resetting itself. The dashboards displayed no data. Such situations happen when we store too much data and influx is excessively using all available memory. To handle the situation we deleted data from before 01/01/2023 tracked by https://bugreports.qt.io/browse/QTQAINFRA-6421. This ticket however refers only to 'coin' databases We do not check how much data is stored and written in other databases. Some of the databases contain project benchmark data , and some private (individual) data. Ideally, private databases should be created on separate instances of influx. The reason for this is there can be situations when the process we do not know about writes too much data and disables influx from working. We should identify who uses databases, and remove databases that are not used. Private databases should be moved to another instance of influx.
1) Identyfy which logical databases on influxdb testrestuls are in use, and which status we do not know:
daniel.smith, ausutter
in use:
coin - coin tasks data
coin_extra - results of preprocessing of coin data, source of information for blacklisted, flaky summary dashboards
coin_capacity - in use - coin capacity data - retention policy of 31 days
not in use
2) Another subject we should check which processes use influx (especially writing queries). They are logged at:
/var/log/influxdb/influxd.log
3) We should create rules/script/calendar event or task for regular archiving and deleting of data. Exact procedure was already created by jimis.
I suggest doing the next archival around Xmass 2024, when there is lower than usual usage of influx. We should do it in advance, ahead of influx getting bloated and unusable. So far we did it roughly once a year. The process is documented at https://bugreports.qt.io/browse/QTQAINFRA-3501.
4. Monitoring influx availability - this can be done by many ways, but I belive its the CI/QA team hat should discover downtown, not developpers.