Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-128394

Mishandling UCD data; lowercasing I-with-dot to i-with-extra-dot

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • P2: Important
    • None
    • 4.5.1, 5.0.0, 6.8
    • None
    • All
    • 25
    • Foundation PM Staging

    Description

      QString::toLower() allegedly maps u"\u0130" to u"\u0069\u0307", confusing the u"\u0049\u0307" decomposition of u"\u0130" (as capital I with combining dot above) with its lower-case form u"\u0069", as supplied by the line

      0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
      

      of qtbase/util/unicode/data/UnicodeData.txt

      The bug is surely in the parsing-and-digestion code of qtbase/util/unicode/main.cpp and almost certainly implies there are other errors in the data this generates; the I-with-dot case is just the one paeglis stumbled upon.

      The situation is complicated by qtbase/util/unicode/data/SpecialCasing.txt providing both

      # Preserve canonical equivalence for I with dot. Turkic is handled below.
      
      0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
      

      and (much later, now locale-specific):

      # Turkish and Azeri
      
      # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
      # The following rules handle those cases.
      
      0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
      0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
      # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
      # This matches the behavior of the canonically equivalent I-dot_above
      
      0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
      0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
      
      # When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
      
      0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
      0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
      
      # When uppercasing, i turns into a dotted capital I
      
      0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
      0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
      
      # Note: the following case is already in the UnicodeData.txt file.
      
      # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
      

      I don't understand all of that but my suspicion is that the Turkic special case is somehow the source of our misguided lowercasing of İ – Qt doesn't currently support locale-specific case mappings (unless it has ICU to delegate to) and we're using these Turkic-specific mappings for general locale-agnostic Unicode.

      Attachments

        1. image-2024-08-29-17-16-14-618.png
          2 kB
          Gatis Paeglis
        2. screenshot-1.png
          0.6 kB
          Gatis Paeglis
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            cnn Qt Core & Network
            Eddy Edward Welbourne
            Vladimir Minenko Vladimir Minenko
            Alex Blasche Alex Blasche
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes