The InfluxDB server is having an ever-increasing number of TCP connections from telegraf on the VMs. These connections seem to remain on ESTABLISHED state and never time-out.
Connections like the following, as reported by netstat -nto. Notice that the connections' timers are off.
Now there are almost
63K 94K 112K supposedly ESTABLISHED connections and this number is increasing, and getting reset only when influxd process goes down. This caused the server to crash once.
This is most likely happening because:
- The OpenNebula VMs are hard-killed so a TCP FIN is never sent from Telegraf
- InfluxDB does not enable SO_KEEPALIVE on the listening socket so it keeps waiting for data, forever.
I see the following courses of action on this:
- Gracefully kill Telegraf at the end of each build
- Coin might have trouble finding the PID of Telegraf, given that it's not always started by Coin itself (provisioning scripts start Telegraf independently)
On the server, install libkeepalive and run InfluxDB with the following environment:LD_PRELOAD=/usr/lib/libkeepalive.so KEEPIDLE=180 KEEPINTVL=10 KEEPCNT=12
- (EDIT: this has been tried and it does not seem to help)
OR On the server, install libdontdie and run InfluxDB with the following environment:Environment=LD_PRELOAD=/usr/lib/libdontdie.so DD_TCP_KEEPALIVE_TIME=180 DD_TCP_KEEPALIVE_INTVL=10 DD_TCP_KEEPALIVE_PROBES=12
- this has also been tried and failed! It seems that golang does not use libc functions to do the system calls, so LD_PRELOAD does not work.
- File a bug in InfluxDB to enable SO_KEEPALIVE on the accept()ed sockets
- InfluxDB issue #9248
- file a pull request to InfluxDB