Details
-
Bug
-
Resolution: Unresolved
-
P2: Important
-
None
-
4.5.1, 5.0.0, 6.8
-
None
-
-
25
-
Foundation PM Staging
Description
QString::toLower() allegedly maps u"\u0130" to u"\u0069\u0307", confusing the u"\u0049\u0307" decomposition of u"\u0130" (as capital I with combining dot above) with its lower-case form u"\u0069", as supplied by the line
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
of qtbase/util/unicode/data/UnicodeData.txt
The bug is surely in the parsing-and-digestion code of qtbase/util/unicode/main.cpp and almost certainly implies there are other errors in the data this generates; the I-with-dot case is just the one paeglis stumbled upon.
The situation is complicated by qtbase/util/unicode/data/SpecialCasing.txt providing both
# Preserve canonical equivalence for I with dot. Turkic is handled below.
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
and (much later, now locale-specific):
# Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. # This matches the behavior of the canonically equivalent I-dot_above 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE 0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE # When lowercasing, unless an I is before a dot_above, it turns into a dotless i. 0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I 0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I # When uppercasing, i turns into a dotted capital I 0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I 0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I # Note: the following case is already in the UnicodeData.txt file. # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
I don't understand all of that but my suspicion is that the Turkic special case is somehow the source of our misguided lowercasing of İ – Qt doesn't currently support locale-specific case mappings (unless it has ICU to delegate to) and we're using these Turkic-specific mappings for general locale-agnostic Unicode.