Loading...

XML

Word

Printable

Details

Type: Change Request
Resolution: Done
Priority: P2: Important
Fix Version/s: None
Component/s: None
Labels:
None

Description

The InfluxDB server is having an ever-increasing number of TCP connections from telegraf on the VMs. These connections seem to remain on ESTABLISHED state and never time-out.

Connections like the following, as reported by netstat -nto. Notice that the connections' timers are off.

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp6       0      0 10.150.153.71:8086      10.225.250.116:49703    ESTABLISHED 21269/influxd        off (0.00/0/0)
tcp6       0      0 10.150.153.71:8086      10.225.249.75:49169     ESTABLISHED 21269/influxd        off (0.00/0/0)
tcp6       0      0 10.150.153.71:8086      10.225.252.224:61714    ESTABLISHED 21269/influxd        off (0.00/0/0)
tcp6       0      0 10.150.153.71:8086      10.225.249.191:49697    ESTABLISHED 21269/influxd        off (0.00/0/0)
tcp6       0      0 10.150.153.71:8086      10.225.138.51:49710     ESTABLISHED 21269/influxd        off (0.00/0/0)

Now there are almost ~~63K~~ ~~94K~~ 112K supposedly ESTABLISHED connections and this number is increasing, and getting reset only when influxd process goes down. This caused the server to crash once.

This is most likely happening because:

The OpenNebula VMs are hard-killed so a TCP FIN is never sent from Telegraf
InfluxDB does not enable SO_KEEPALIVE on the listening socket so it keeps waiting for data, forever.

I see the following courses of action on this:

Gracefully kill Telegraf at the end of each build
- Coin might have trouble finding the PID of Telegraf, given that it's not always started by Coin itself (provisioning scripts start Telegraf independently)
~~On the server, install libkeepalive and run InfluxDB with the following environment:~~ LD_PRELOAD=/usr/lib/libkeepalive.so KEEPIDLE=180 KEEPINTVL=10 KEEPCNT=12
- (EDIT: this has been tried and it does not seem to help)
~~OR On the server, install libdontdie and run InfluxDB with the following environment:~~ Environment=LD_PRELOAD=/usr/lib/libdontdie.so DD_TCP_KEEPALIVE_TIME=180 DD_TCP_KEEPALIVE_INTVL=10 DD_TCP_KEEPALIVE_PROBES=12
- this has also been tried and failed! It seems that golang does not use libc functions to do the system calls, so LD_PRELOAD does not work.
File a bug in InfluxDB to enable SO_KEEPALIVE on the accept()ed sockets
- InfluxDB issue #9248
- file a pull request to InfluxDB

Attachments

Issue Links

resulted in

COIN-431 Coin agent should terminate the processes it starts in the background

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dimitrios Apostolou

Reporter:: Dimitrios Apostolou

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16 Oct '19 13:14

Updated:: 07 May '20 08:31

Resolved:: 07 May '20 08:19

Gerrit Reviews

There are no open Gerrit changes