Details

    • Change Request
    • Resolution: Done
    • P2: Important
    • None
    • None
    • None

    Description

      The InfluxDB server is having an ever-increasing number of TCP connections from telegraf on the VMs. These connections seem to remain on ESTABLISHED state and never time-out.

      Connections like the following, as reported by netstat -nto. Notice that the connections' timers are off.

      Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
      tcp6       0      0 10.150.153.71:8086      10.225.250.116:49703    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0 10.150.153.71:8086      10.225.249.75:49169     ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0 10.150.153.71:8086      10.225.252.224:61714    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0 10.150.153.71:8086      10.225.249.191:49697    ESTABLISHED 21269/influxd        off (0.00/0/0)
      tcp6       0      0 10.150.153.71:8086      10.225.138.51:49710     ESTABLISHED 21269/influxd        off (0.00/0/0)
      

      Now there are almost 63K 94K 112K supposedly ESTABLISHED connections and this number is increasing, and getting reset only when influxd process goes down. This caused the server to crash once.

      This is most likely happening because:

      • The OpenNebula VMs are hard-killed so a TCP FIN is never sent from Telegraf
      • InfluxDB does not enable SO_KEEPALIVE on the listening socket so it keeps waiting for data, forever.

      I see the following courses of action on this:

      • Gracefully kill Telegraf at the end of each build
        • Coin might have trouble finding the PID of Telegraf, given that it's not always started by Coin itself (provisioning scripts start Telegraf independently)
      • On the server, install libkeepalive and run InfluxDB with the following environment: LD_PRELOAD=/usr/lib/libkeepalive.so KEEPIDLE=180 KEEPINTVL=10 KEEPCNT=12
        • (EDIT: this has been tried and it does not seem to help)
      • OR On the server, install libdontdie and run InfluxDB with the following environment: Environment=LD_PRELOAD=/usr/lib/libdontdie.so DD_TCP_KEEPALIVE_TIME=180 DD_TCP_KEEPALIVE_INTVL=10 DD_TCP_KEEPALIVE_PROBES=12
        • this has also been tried and failed! It seems that golang does not use libc functions to do the system calls, so LD_PRELOAD does not work.
      • File a bug in InfluxDB to enable SO_KEEPALIVE on the accept()ed sockets

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              jimis Dimitrios Apostolou
              jimis Dimitrios Apostolou
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes