Details
-
Bug
-
Resolution: Incomplete
-
Not Evaluated
-
None
-
4.8.1
-
None
-
Linux, Ubuntu 12.04
Description
When trying to read an XML document containing the thin space character (U+2009, written as as the text content of an element, QtDomDocument discards this character.
Although the QDomDocument documentation says that whitespace is stripped from text nodes, the XML specification does not treat such characters as white space, only "space (#x20) characters, carriage returns, line feeds, or tabs" are whitespace - see http://www.w3.org/TR/REC-xml/#NT-S . I've also tried other unicode whitespace characters (U+20xx) - these too are stripped by QDomDocument.
This problem was encountered interpreting a MathML document, where whitespace character text nodes are common. Please see this thread where this bug is explored in more detail (thanks to those in the thread for helping me) - http://stackoverflow.com/questions/10968940/why-is-qt-losing-my-thin-space-unicode-character-when-loading-an-xml-file
The problem can be seen by just doing
QDomDocument doc;
doc.setContent(QString("<mtext> </mtext>"));
and examining doc.toString(), which is just "<mtext/>". I've included a more complete example, which escapes any unicode output on the screen.