Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-106795

QPdfDocument::getAllText missing characters

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • P3: Somewhat important
    • None
    • 6.3.2
    • PDF
    • None
    • Windows

    Description

      QPdfDocument::getAllText does not return all characters, some characters are missing. Please check the out.pdf I posted. The out.txt is generated by fitz+PyMuPDF:

      python3 -m fitz gettext -pages 1 out.pdf

      It works fine. But result from QPdfDocument::getAllText missing some charactors, I put the result in getAllText.txt file. Here is the diff :

      getAllText: 是一个共享 ,供 个 系 统 (如在计算 机之

      PyMuPDF: 接口是一个共享框架,供 个 系 统 (如在计算机和打印机之间

      As it shows , a lot character are missing. I think pdfium returned wrong result, but chrome can handle this pdf correctly (copy works fine, along with other pdf viewers ). May be it's relevant to chromium version Qt used?

      Attachments

        1. 2022-09-20 22_08_33-test2.pdf - SumatraPDF.png
          2022-09-20 22_08_33-test2.pdf - SumatraPDF.png
          223 kB
        2. getAllText.txt
          3 kB
        3. out.pdf
          98 kB
        4. out.txt
          4 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              srutledg Shawn Rutledge
              zhaohongjian000 zhao hongjian
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes