Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-135129

QXmlStreamReader::addData(QASV) overload unconditionally converts UTF-16 and L1 to UTF-8

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • P2: Important
    • None
    • 6.5, 6.6, 6.7, 6.8, 6.9.0 RC
    • None
    • 8
    • Foundation Sprint 128

    Description

      The addData(QAnyStringView) overload unconditionally converts Latin1 and UTF-16 strings to UTF-8. This can be fine if it's the first data added to the reader, or if the previous data was also in the UTF-8 encoding (or was converted to it).

      However, assume a case when we originally provided some raw data (via a constructor taking QByteArray or addData(QByteArray) overload), and this raw data had a proper XML prolog with an "encoding" attribute set. The specified encoding could be one of Latin1 or UTF-16, in which case the parser would internally use the respective decoder.

      After that, appending a string in the proper encoding should just work fine. However, in practice it will result in corrupted data (for Latin1) or even a parsing error (for UTF-16), because we unconditionally convert this data to UTF-8.

      This can be illustrated by a simple example:

      const QString utf16Str = u"<?xml version=\"1.0\" encoding=\"utf-16\"?>"
                                "<foo><a>Some Data</a>"_s;
      // as byte array
      const QByteArray utf16Data{
          reinterpret_cast<const char *>(utf16Str.utf16()),
          utf16Str.size() * 2
      };
      
      QXmlStreamReader reader(utf16Data);
      bool res = reader.readNextStartElement(); // Read <foo>, OK
      res = reader.readNextStartElement(); // Read <a>, OK
      QString text = reader.readElementText();
      
      // append more UTF-16 data
      reader.addData(u"<a>Other Data</a>"_s);
      res = reader.readNextStartElement(); // Read <a>, FAIL!
      qDebug() << reader.errorString(); // "Premature end of document."
      

      The problem is that newly-added data is converted to UTF-8 and has twice less bytes than the UTF-16 parser expects.

      For the Latin1 + Latin1 case the parsing will work, but non-ASCII characters will be read incorrectly.

      Attachments

        Issue Links

          For Gerrit Dashboard: QTBUG-135129
          # Subject Branch Project Status CR V

          Activity

            People

              ivan.solovev Ivan Solovev
              ivan.solovev Ivan Solovev
              Vladimir Minenko Vladimir Minenko
              Alex Blasche Alex Blasche
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews