Uploaded image for project: 'Qt for Python'
  1. Qt for Python
  2. PYSIDE-1835

Truncated multibyte unicode strings

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • P2: Important
    • 5.15.9
    • 5.15.2, 6.2.2
    • Shiboken
    • None
    • Windows
    • 1b22fa59bf (pyside/tqtc-pyside-setup/5.15) 1b22fa59bf (pyside/tqtc-pyside-setup/tqtc/5.15)

    Description

      This appears to only occur if Py_UNICODE_WIDE is not defined.

      The PySide2 build that produces the error has the following preprocessor defines:

      Py_LIMITED_API is not defined.
      QT_VERSION is < QT_VERSION_CHECK(6,0,0)
      Py_UNICODE_WIDE is not defined

      The code in question that is erroneous is this:

      // Python to C++ conversions for type 'QString'.
      static void PyUnicode_PythonToCpp_QString(PyObject *pyIn, void *cppOut) {
          // ========================================================================
          // START of custom code block [file: ../glue/qtcore.cpp (conversion-pyunicode)]
          #ifndef Py_LIMITED_API
          Py_UNICODE *unicode = PyUnicode_AS_UNICODE(pyIn);
          #  if defined(Py_UNICODE_WIDE)
          // cast as Py_UNICODE can be a different type
          #    if QT_VERSION >= QT_VERSION_CHECK(6, 0, 0)
          *reinterpret_cast<::QString *>(cppOut) = QString::fromUcs4(reinterpret_cast<const char32_t *>(unicode));
          #    else
          *reinterpret_cast<::QString *>(cppOut) = QString::fromUcs4(reinterpret_cast<const uint *>(unicode));
          #    endif // Qt 6
          #  else // Py_UNICODE_WIDE
          #    if QT_VERSION >= QT_VERSION_CHECK(6, 0, 0)
          *reinterpret_cast<::QString *>(cppOut) = QString::fromUtf16(reinterpret_cast<const char16_t *>(unicode), PepUnicode_GetLength(pyIn));
          #    else
          *reinterpret_cast<::QString *>(cppOut) = QString::fromUtf16(reinterpret_cast<const ushort *>(unicode), PepUnicode_GetLength(pyIn));
          #    endif // Qt 6
          # endif
          #else
          wchar_t *temp = PyUnicode_AsWideCharString(pyIn, NULL);
          *reinterpret_cast<::QString *>(cppOut) = QString::fromWCharArray(temp);
          PyMem_Free(temp);
          #endif
          // END of custom code block [file: ../glue/qtcore.cpp (conversion-pyunicode)]
          // ========================================================================
      }
      

      With the preprocessor defines as they are defined above, the following is called:

      *reinterpret_cast<::QString *>(cppOut) = QString::fromUtf16(reinterpret_cast<const ushort *>(unicode), PepUnicode_GetLength(pyIn));
      

      PLEASE NOTE: Qt's Jira apparently does not allow emoji to be contained within issue descriptions (it should be upgraded - other Jira instances do support this).
      Below you will find that I reference right-angled-magnifying-glass-emoji .

      I follow that PepUnicode_GetLength(pyIn) returns a length in code points rather than length in ushort characters.

      Looking at the PySide2 code defining PepUnicode_GetLength and reading the docs, I think the behavior is likely correct if Python 2 is used because PyUnicode_GetSize is called - which will "Return the size of the deprecated Py_UNICODE representation, in code units (this includes surrogate pairs as 2 units)."†
      The behavior is incorrect if Python 3 is used because PyUnicode_GetLength is called, which will "Return the length of the Unicode object, in code points."

      It appears that a code unit likely refers to a uchar, and a code point refers to the number of glyphs - which could be represented either with a single byte, or with multiple bytes - like in the case of .
      This Wikipedia article provides definitions of what "code unit" and "code point" are. For our purposes, it seems the two could be used interchangeably - and a code unit could contain more bits than one byte (so it would be 's bit representation).

      After reading that Wikipedia article's definition of code unit, I am left confused, however it does resolve my prior confusion in that the Python docs show PyUnicode_GET_SIZE and PyUnicode_GET_DATA_SIZE (which I thought dealt with byte lengths) are deprecated and replaced with PyUnicode_GET_LENGTH (which returns length in code points).

      So - I don't see any Python API that will return the length of a PyUnicode object in bytes - which is what is needed for creating a QString out of it.

      What I'm trying to sort out from all this is - what would be the proper fix in PySide to fix this?
      It does seem like this ../glue/qtcore.cpp (conversion-pyunicode) does indeed contain a bug and needs changing - I just don't understand how to get the proper length to pass to QString::fromUtf16.

      † This documentation is from Python 3, but applies to Python 2 as well, but with a better description than was present in Python 2's documentation of this function.

      Attachments

        1. right-handed-magnifying-glass-emoji.png
          right-handed-magnifying-glass-emoji.png
          0.8 kB
        2. pyside-1835-shiboken-multibyte-emoji.py
          0.1 kB
        3. pyside1835.zip
          2 kB
        4. pyside1835.py
          0.7 kB
        5. pyside1835_diag2.diff
          0.6 kB
        6. pyside1835_diag.diff
          0.6 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              kleint Friedemann Kleint
              kkyzivat Keith Kyzivat
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes