• Type: Change Request
    • Status: Closed
    • Priority: P2: Important
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      The InfluxDB server is having an ever-increasing number of TCP connections from telegraf on the VMs. These connections seem to remain on ESTABLISHED state and never time-out.

      Connections like the following, as reported by netstat -nto. Notice that the connections' timers are off.

      Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
      tcp6       0      0    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0     ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0     ESTABLISHED 21269/influxd        off (0.00/0/0)

      Now there are almost 63K 94K 112K supposedly ESTABLISHED connections and this number is increasing, and getting reset only when influxd process goes down. This caused the server to crash once.

      This is most likely happening because:

      • The OpenNebula VMs are hard-killed so a TCP FIN is never sent from Telegraf
      • InfluxDB does not enable SO_KEEPALIVE on the listening socket so it keeps waiting for data, forever.

      I see the following courses of action on this:

      • Gracefully kill Telegraf at the end of each build
        • Coin might have trouble finding the PID of Telegraf, given that it's not always started by Coin itself (provisioning scripts start Telegraf independently)
      • On the server, install libkeepalive and run InfluxDB with the following environment: LD_PRELOAD=/usr/lib/ KEEPIDLE=180 KEEPINTVL=10 KEEPCNT=12
        • (EDIT: this has been tried and it does not seem to help)
      • OR On the server, install libdontdie and run InfluxDB with the following environment: Environment=LD_PRELOAD=/usr/lib/ DD_TCP_KEEPALIVE_TIME=180 DD_TCP_KEEPALIVE_INTVL=10 DD_TCP_KEEPALIVE_PROBES=12
        • this has also been tried and failed! It seems that golang does not use libc functions to do the system calls, so LD_PRELOAD does not work.
      • File a bug in InfluxDB to enable SO_KEEPALIVE on the accept()ed sockets


          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.



              jimis Dimitrios Apostolou
              jimis Dimitrios Apostolou
              0 Vote for this issue
              2 Start watching this issue



                  Gerrit Reviews

                  There are no open Gerrit changes