Details
-
Change Request
-
Resolution: Done
-
P2: Important
-
None
-
None
-
None
Description
The InfluxDB server is having an ever-increasing number of TCP connections from telegraf on the VMs. These connections seem to remain on ESTABLISHED state and never time-out.
Connections like the following, as reported by netstat -nto. Notice that the connections' timers are off.
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer tcp6 0 0 10.150.153.71:8086 10.225.250.116:49703 ESTABLISHED 21269/influxd off (0.00/0/0) tcp6 0 0 10.150.153.71:8086 10.225.249.75:49169 ESTABLISHED 21269/influxd off (0.00/0/0) tcp6 0 0 10.150.153.71:8086 10.225.252.224:61714 ESTABLISHED 21269/influxd off (0.00/0/0) tcp6 0 0 10.150.153.71:8086 10.225.249.191:49697 ESTABLISHED 21269/influxd off (0.00/0/0) tcp6 0 0 10.150.153.71:8086 10.225.138.51:49710 ESTABLISHED 21269/influxd off (0.00/0/0)
Now there are almost 63K 94K 112K supposedly ESTABLISHED connections and this number is increasing, and getting reset only when influxd process goes down. This caused the server to crash once.
This is most likely happening because:
- The OpenNebula VMs are hard-killed so a TCP FIN is never sent from Telegraf
- InfluxDB does not enable SO_KEEPALIVE on the listening socket so it keeps waiting for data, forever.
I see the following courses of action on this:
- Gracefully kill Telegraf at the end of each build
- Coin might have trouble finding the PID of Telegraf, given that it's not always started by Coin itself (provisioning scripts start Telegraf independently)
On the server, install libkeepalive and run InfluxDB with the following environment:LD_PRELOAD=/usr/lib/libkeepalive.so KEEPIDLE=180 KEEPINTVL=10 KEEPCNT=12- (EDIT: this has been tried and it does not seem to help)
OR On the server, install libdontdie and run InfluxDB with the following environment:Environment=LD_PRELOAD=/usr/lib/libdontdie.so DD_TCP_KEEPALIVE_TIME=180 DD_TCP_KEEPALIVE_INTVL=10 DD_TCP_KEEPALIVE_PROBES=12- this has also been tried and failed! It seems that golang does not use libc functions to do the system calls, so LD_PRELOAD does not work.
- File a bug in InfluxDB to enable SO_KEEPALIVE on the accept()ed sockets
- InfluxDB issue #9248
- file a pull request to InfluxDB
Attachments
Issue Links
- resulted in
-
COIN-431 Coin agent should terminate the processes it starts in the background
-
- Closed
-