Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-141538

QTextBoundaryFinder Incorrectly Handles Regional Indicator Grapheme Boundaries

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: P2: Important P2: Important
    • None
    • 6.9, 6.10
    • Core: Other
    • None
    • Linux/Wayland

      Summary

       

      QTextBoundaryFinder with BoundaryType::Grapheme incorrectly treats consecutive Regional Indicator Symbol pairs (emoji flags) as a single grapheme cluster instead of breaking them into separate pairs. This violates UAX #29 rules GB12 and GB13.

      Description

      When processing strings containing multiple consecutive emoji flags (Regional Indicator Symbol pairs), QTextBoundaryFinder fails to detect boundaries between flag pairs and incorrectly merges them into a single grapheme cluster.

      Example

      For the string "A😀🇸🇪🇸🇪A":

      • Expected: 5 grapheme clusters:  {{A, 😀, SE, SE, A}}
      • Actual: 6 grapheme cluster: {{A, 😀, S, ES, S, A}}

      Steps to Reproduce

      Test Code:

      #include <QCoreApplication>
      #include <QString>
      #include <QTextBoundaryFinder>
      #include <QDebug>

      void testGraphemeBoundaries(const QString &input, const QString &description) {
          qDebug() << "\n=== Testing:" << description << "===";
          qDebug() << "Input length (UTF-16 code units):" << input.length();
          
          QTextBoundaryFinder finder(QTextBoundaryFinder::Grapheme, input);
          
          QList<int> boundaries;
          int pos = finder.position();
          while (pos != -1) {
              boundaries.append(pos);
              pos = finder.toNextBoundary();
               }
          
          qDebug() << "Boundaries found:" << boundaries;
          qDebug() << "Grapheme cluster count:" << (boundaries.size() - 1);
          
          // Display each grapheme cluster
          for (int i = 0; i < boundaries.size() - 1; ++i) {
              int start = boundaries[i];
              int end = boundaries[i + 1];
              QString cluster = input.mid(start, end - start);
              qDebug() << QString("  Cluster %1: [%2:%3] = %4")
                              .arg
                              .arg(start)
                              .arg(end)
                              .arg(cluster);
              }
      }

      int main(int argc, char *argv[]) {
          QCoreApplication app(argc, argv);
          
          // Test case 1: Two Swedish flags (the reported bug)
          QString twoFlags = QString::fromUtf8("A😀🇸🇪🇸🇪A");
          testGraphemeBoundaries(twoFlags, "Two Swedish flags");
          qDebug() << "Expected: 5 grapheme clusters, Got:" << "1 (BUG!)";
          
          // Test case 2: Three Swedish flags
          QString threeFlags = QString::fromUtf8("A😀🇸🇪🇸🇪🇸🇪A");
          testGraphemeBoundaries(threeFlags, "Three Swedish flags");
          qDebug() << "Expected: 6 grapheme clusters";
          
          // Test case 3: Mixed content with flags
          QString mixed = QString::fromUtf8("Hello 🇸🇪🇸🇪 World");
          testGraphemeBoundaries(mixed, "Mixed content with flags");
          
          // Test case 4: Single flag (should work correctly)
          QString oneFlag = QString::fromUtf8("🇸🇪");
          testGraphemeBoundaries(oneFlag, "Single Swedish flag");
          qDebug() << "Expected: 1 grapheme cluster";
          
          return 0;
      }

       

      Correct Behavior per UAX #29:
      The string should have 5 boundaries.

      Root Cause

      According to UAX #29 (Unicode Text Segmentation), the Regional Indicator breaking rules are:

      • GB12: sot (RI RI)* RI × RI - Do not break within emoji flag sequences
      • GB13: [^RI] (RI RI)* RI × RI - Do not break within emoji flag sequences

      These rules state that Regional Indicators should pair up - a sequence like RI RI RI RI should break as [RI RI] [RI RI].

      The critical requirement: lookbehind is necessary to count how many RI characters precede the current position. A boundary occurs at even-numbered RI positions (0, 2, 4, 6...).

      Qt's current implementation (in qtbase/src/corelib/text/qunicodetools.cpp) appears to be missing this lookbehind logic, treating all consecutive RI characters as a single grapheme cluster.

      Suggested Fix

       

      // When checking boundary at position between char[i-1] and char[i]
      if (char[i] is Regional_Indicator) {
          if (char[i-1] is Regional_Indicator) {
              // Count consecutive RIs backward from position i-1
              int riCount = countBackwardRI(i - 1);
              
              if (riCount % 2 == 1) {            
                  // Odd number before = we're the 2nd of a pair            
                  // Do NOT break (GB12/GB13)            
                  return NO_BOUNDARY;        
              } else {            
                  // Even number = start of new pair            
                  return BOUNDARY;        
                       }
          } else {        
              // First RI after non-RI = start of new pair        
              return BOUNDARY;    
               }
      }

       

       

      Implementation Location:
      Modify the grapheme cluster breaking logic in qtbase/src/corelib/text/qunicodetools.cpp, specifically in the function that checks the Grapheme_Cluster_Break property.

      Reference Implementations

      Correct implementations can be found in:

      1. Rust unicode-segmentation: https://github.com/unicode-rs/unicode-segmentation/blob/master/src/grapheme.rs
        • See the Regional state enum and lookbehind logic around line 200
      2. libgrapheme: https://libs.suckless.org/libgrapheme/
        • Uses flip-flop state flag GRAPHEME_FLAG_RI_ODD

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            cnn Qt Core & Network
            kodarn Göran Nylén
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:

                There are no open Gerrit changes