Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: P2: Important
Fix Version/s: 6.5.9, 6.8.4, 6.9.1, 6.10.0 FF
Affects Version/s: 6.5, 6.6, 6.7, 6.8, 6.9.0 RC
Component/s: XML: Stream Reader/Writer
Labels:
None

Story Points:
8
Commits:
046b8523f (dev), 0d2fd73c2 (6.9), b6b725aef (dev), 49d13204f (6.8), 3bcf704e3 (tqtc/lts-6.5), ec8f043ec (6.9), f58657efc (6.8), 692a35624 (tqtc/lts-6.5)
Sprint:
Foundation Sprint 128, Foundation Sprint 129

Description

The addData(QAnyStringView) overload unconditionally converts Latin1 and UTF-16 strings to UTF-8. This can be fine if it's the first data added to the reader, or if the previous data was also in the UTF-8 encoding (or was converted to it).

However, assume a case when we originally provided some raw data (via a constructor taking QByteArray or addData(QByteArray) overload), and this raw data had a proper XML prolog with an "encoding" attribute set. The specified encoding could be one of Latin1 or UTF-16, in which case the parser would internally use the respective decoder.

After that, appending a string in the proper encoding should just work fine. However, in practice it will result in corrupted data (for Latin1) or even a parsing error (for UTF-16), because we unconditionally convert this data to UTF-8.

This can be illustrated by a simple example:

const QString utf16Str = u"<?xml version=\"1.0\" encoding=\"utf-16\"?>"
                          "<foo><a>Some Data</a>"_s;
// as byte array
const QByteArray utf16Data{
    reinterpret_cast<const char *>(utf16Str.utf16()),
    utf16Str.size() * 2
};

QXmlStreamReader reader(utf16Data);
bool res = reader.readNextStartElement(); // Read <foo>, OK
res = reader.readNextStartElement(); // Read <a>, OK
QString text = reader.readElementText();

// append more UTF-16 data
reader.addData(u"<a>Other Data</a>"_s);
res = reader.readNextStartElement(); // Read <a>, FAIL!
qDebug() << reader.errorString(); // "Premature end of document."

The problem is that newly-added data is converted to UTF-8 and has twice less bytes than the UTF-16 parser expects.

For the Latin1 + Latin1 case the parsing will work, but non-ASCII characters will be read incorrectly.

Attachments

Issue Links

relates to

QTBUG-135033 QXmlStreamReader::addData() can parse Latin1 data incorrectly

Closed

QTBUG-124636 QXmlStreamReader: don't convert input

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: QTBUG-135129
#	Subject	Branch	Project	Status	CR	V
634101,11	QXmlStreamReader: check appending data with unexpected encoding	dev	qt/qtbase	Status: MERGED	+2	0
636060,8	QXmlStreamReader: fix addData() unnecessary conversion to UTF-8	dev	qt/qtbase	Status: MERGED	+2	0
637365,2	QXmlStreamReader: check appending data with unexpected encoding	6.9	qt/qtbase	Status: MERGED	+2	0
637595,2	QXmlStreamReader: check appending data with unexpected encoding	6.8	qt/qtbase	Status: MERGED	+2	0
637610,2	QXmlStreamReader: fix addData() unnecessary conversion to UTF-8	6.9	qt/qtbase	Status: MERGED	+2	0
637639,2	QXmlStreamReader: check appending data with unexpected encoding	tqtc/lts-6.5	qt/tqtc-qtbase	Status: MERGED	+2	0
639721,2	QXmlStreamReader: fix addData() unnecessary conversion to UTF-8	6.8	qt/qtbase	Status: MERGED	+2	0
639759,2	QXmlStreamReader: fix addData() unnecessary conversion to UTF-8	tqtc/lts-6.5	qt/tqtc-qtbase	Status: MERGED	+2	0