Details
-
Bug
-
Resolution: Unresolved
-
P3: Somewhat important
-
None
-
dev
-
None
-
-
21
-
Foundation PM Prioritized
Description
Discovered at: https://codereview.qt-project.org/c/qt/qtbase/+/488440/23..24/src/corelib/text/qstringconverter.h#b245
QStringDecoder::appendToBufferchar16_t *out, QByteArrayView ba)
produces inconsistent results when feeding the data piecemeal, compared to feeding the data in one chunk or using alternative decoding facilities as 'QString::fromUtf8()'
The unicode standard doesn't enforce a strict design on how to handle illegal sequences, but I think we can all agree that the one we use should be consistent across all conversion mechanism we provide.
const char illFormed[] = u8"a\xe0\x9f\x80""a";
In this illegal sequence, we have: 1 code-point (a), \xe0 (legal), \x9f (illegal), \x80 (illegal), 1 code-point(a)
QString and QStringDecoder will write a replacement character for every illegal sequence that we encounter:
"a\uFFFD\uFFFD\uFFFDa"
however, feeding the data byte-by-byte will produce:
"a\uFFFDa"
only writing a single replacement char for the whole sequence. Perhaps some internal state handling in QStringDecoder is producing this inconsistency.
Please find the reproducer attached.
Reproducer output: "fromUtf8" OK "decoder" OK "decoderAppendFull" OK "decoderAppendPieceMeal" "a�a" != "a���a"