Description
QXmlSimpleReaderPrivate::parseReference in sax/qxml.cpp converts character references by doing
tmp = ref().toUInt(&ok, 16); stringAddC(QChar(tmp));
which truncates any characters in supplementary planes (higher than U+FFFF) to QChar. The correct behaviour should be to add its high and low surrogates.
Reproducer:
#include <QDebug> #include <QDomDocument> #include <QXmlSimpleReader> #include <QXmlStreamReader> int main(int argc, char *argv[]) { QString xml = QStringLiteral(u"<test>👩</test>"); QDomDocument doc; QString errorMsg; #if 1 QXmlSimpleReader reader; reader.setFeature("http://qt-project.org/xml/features/report-whitespace-only-CharData", true); reader.setFeature("http://xml.org/sax/features/namespaces", false); reader.setFeature("http://xml.org/sax/features/namespace-prefixes", true); QXmlInputSource source; source.setData(xml); doc.setContent(&source, &reader, &errorMsg); #else QXmlStreamReader streamReader(xml); doc.setContent(&streamReader, false, &errorMsg); #endif qDebug() << errorMsg; qDebug().quote() << doc.documentElement().text(); }
Actual Output:
""
"\uF469"
Expected Output:
""
"👩"
Probably also affects Qt5Compat though I haven't tested it as the reproducer doesn't build on Qt6.
Downstream bug:
https://bugs.kde.org/show_bug.cgi?id=473380