Details
-
Bug
-
Resolution: Fixed
-
P2: Important
-
None
-
production
-
None
Description
Occasionally there are a bunch of failures in the CI, VMs have I/O errors, permission errors, freezes etc. This appears to be caused by hosts momentarily becoming I/O stuck for some reason.
When the issue happens hosts log some NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
It can be seen that there are issues with NFS
https://inframetrics.intra.qt.io/d/h1lbJxcWz/detailed-host-data-for-performance-debugging?orgId=1&from=1685007032885&to=1685007251511&var-host=ace-fawn&var-interval=10s&var-ret_policy=autogen&var-net_interface=All&var-perops=WRITE
Another timestamp:
https://inframetrics.intra.qt.io/d/h1lbJxcWz/detailed-host-data-for-performance-debugging?orgId=1&from=1684831180825&to=1684831987823
Also that something becomes blocked
https://inframetrics.intra.qt.io/d/nOAsINNZz/telegraf-hosts?orgId=1&var-server=ace-fawn&var-inter=1s&from=1685007013920&to=1685007279879&viewPanel=28239