Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-104493

Application performance drops if many threads are running while QProcess creates sub-processes on Unix

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • P2: Important
    • 6.4.0 RC1, 6.5.0 Beta1
    • 5.15.2, 6.3.1, 6.4
    • Core: I/O
    • None
    • e1a787a76e (qt/qtbase/dev) e1a787a76e (qt/tqtc-qtbase/dev) 0c9fc4bfa7 (qt/qtbase/6.4) 0c9fc4bfa7 (qt/tqtc-qtbase/6.4), 29b2fe40d (dev), 65528387d (6.5), c5221f6be (dev), c5c771291 (dev), 7c4e271fe (dev), 9962441bf (6.7), f0e4f50fd (6.6)

    Description

      This was first detected in KWrite / KATE, see kwrite-devel mailing list.

      Quoting Milian Wolf on 3991116.kMKLVPbKup () milian-workstation:

      OK, this is apparently totally unrelated to git and kate. Thiago, do you
      happen to have an insight here maybe? Is it known that using QProcess can
      really badly influence the runtime behavior of malloc in other threads?

      Here's a small example to trigger this behavior already:

      https://invent.kde.org/-/snippets/2239

      I have nproc == 24. Let's run this without any external processes:

      $ perf stat -r 5 ./slow-malloc 
       Performance counter stats for './slow-malloc' (5 runs):
      
                6,868.17 msec task-clock                #   12.781 CPUs utilized            ( +-  0.82% )
                  35,262      context-switches          #    5.078 K/sec                    ( +-  0.73% )
                   1,518      cpu-migrations            #  218.590 /sec                     ( +- 10.47% )
                 477,765      page-faults               #   68.797 K/sec                    ( +-  0.23% )
          27,414,859,033      cycles                    #    3.948 GHz                      ( +-  0.88% )  (84.46%)
           9,269,828,127      stalled-cycles-frontend   #   33.46% frontend cycles idle     ( +-  0.80% )  (84.58%)
           2,503,409,257      stalled-cycles-backend    #    9.04% backend cycles idle      ( +-  1.38% )  (82.85%)
          12,211,168,505      instructions              #    0.44  insn per cycle         
                                                        #    0.77  stalled cycles per insn  ( +-  0.26% )  (82.54%)
           2,699,403,475      branches                  #  388.710 M/sec                    ( +-  0.34% )  (82.99%)
               7,276,801      branch-misses             #    0.27% of all branches          ( +-  0.68% )  (84.56%)
      
                 0.53735 +- 0.00317 seconds time elapsed  ( +-  0.59% )
      

      So far so good. Now let's also run `ls /tmp` , which by itself is plenty fast:

      $ time ls /tmp
      
      real    0m0.006s
      user    0m0.000s
      sys     0m0.006s
      

      Doing that a hundred times per thread as in the example file above should only
      take ~600ms. But instead this is what I observe:

      $ perf stat -r 5 ./slow-malloc --with-subprocess
      
       Performance counter stats for './slow-malloc --with-subprocess' (5 runs):
      
               26,197.00 msec task-clock                #    4.373 CPUs utilized            ( +-  0.29% )
                 148,400      context-switches          #    5.669 K/sec                    ( +-  2.19% )
                  11,287      cpu-migrations            #  431.174 /sec                     ( +-  2.25% )
               1,559,820      page-faults               #   59.587 K/sec                    ( +-  0.22% )
          99,501,234,050      cycles                    #    3.801 GHz                      ( +-  0.15% )  (85.67%)
          30,922,803,968      stalled-cycles-frontend   #   31.18% frontend cycles idle     ( +-  0.17% )  (85.00%)
          21,809,486,987      stalled-cycles-backend    #   21.99% backend cycles idle      ( +-  0.74% )  (84.85%)
          62,524,522,174      instructions              #    0.63  insn per cycle         
                                                        #    0.49  stalled cycles per insn  ( +-  0.17% )  (84.84%)
          14,128,484,480      branches                  #  539.721 M/sec                    ( +-  0.27% )  (85.23%)
             114,841,497      branch-misses             #    0.82% of all branches          ( +-  0.26% )  (85.86%)
      
                  5.9904 +- 0.0258 seconds time elapsed  ( +-  0.43% )
      

      And perf off-CPU profiling with hotspot again shows the excessive wait time in
      rwsem_down_read_slowpath when _int_malloc hits the asm_exc_page_fault code.

      Any insight would be welcome, or suggestions on how to better handle this in
      user code.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              thiago Thiago Macieira
              thiago Thiago Macieira
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: