- 
    Bug 
- 
    Resolution: Unresolved
- 
    P2: Important 
- 
    None
- 
    6.9, 6.10
- 
    None
Summary
QTextBoundaryFinder with BoundaryType::Grapheme incorrectly treats consecutive Regional Indicator Symbol pairs (emoji flags) as a single grapheme cluster instead of breaking them into separate pairs. This violates UAX #29 rules GB12 and GB13.
Description
When processing strings containing multiple consecutive emoji flags (Regional Indicator Symbol pairs), QTextBoundaryFinder fails to detect boundaries between flag pairs and incorrectly merges them into a single grapheme cluster.
Example
For the string "A😀🇸🇪🇸🇪A":
- Expected: 5 grapheme clusters: {{A, 😀, SE, SE, A}}
- Actual: 6 grapheme cluster: {{A, 😀, S, ES, S, A}}
Steps to Reproduce
Test Code:
#include <QCoreApplication>
#include <QString>
#include <QTextBoundaryFinder>
#include <QDebug>
void testGraphemeBoundaries(const QString &input, const QString &description) {
    qDebug() << "\n=== Testing:" << description << "===";
    qDebug() << "Input length (UTF-16 code units):" << input.length();
    
    QTextBoundaryFinder finder(QTextBoundaryFinder::Grapheme, input);
    
    QList<int> boundaries;
    int pos = finder.position();
    while (pos != -1) {
        boundaries.append(pos);
        pos = finder.toNextBoundary();
         }
    
    qDebug() << "Boundaries found:" << boundaries;
    qDebug() << "Grapheme cluster count:" << (boundaries.size() - 1);
    
    // Display each grapheme cluster
    for (int i = 0; i < boundaries.size() - 1; ++i) {
        int start = boundaries[i];
        int end = boundaries[i + 1];
        QString cluster = input.mid(start, end - start);
        qDebug() << QString("  Cluster %1: [%2:%3] = %4")
                        .arg
                        .arg(start)
                        .arg(end)
                        .arg(cluster);
        }
}
int main(int argc, char *argv[]) {
    QCoreApplication app(argc, argv);
    
    // Test case 1: Two Swedish flags (the reported bug)
    QString twoFlags = QString::fromUtf8("A😀🇸🇪🇸🇪A");
    testGraphemeBoundaries(twoFlags, "Two Swedish flags");
    qDebug() << "Expected: 5 grapheme clusters, Got:" << "1 (BUG!)";
    
    // Test case 2: Three Swedish flags
    QString threeFlags = QString::fromUtf8("A😀🇸🇪🇸🇪🇸🇪A");
    testGraphemeBoundaries(threeFlags, "Three Swedish flags");
    qDebug() << "Expected: 6 grapheme clusters";
    
    // Test case 3: Mixed content with flags
    QString mixed = QString::fromUtf8("Hello 🇸🇪🇸🇪 World");
    testGraphemeBoundaries(mixed, "Mixed content with flags");
    
    // Test case 4: Single flag (should work correctly)
    QString oneFlag = QString::fromUtf8("🇸🇪");
    testGraphemeBoundaries(oneFlag, "Single Swedish flag");
    qDebug() << "Expected: 1 grapheme cluster";
    
    return 0;
}
Correct Behavior per UAX #29: 
The string should have 5 boundaries.
Root Cause
According to UAX #29 (Unicode Text Segmentation), the Regional Indicator breaking rules are:
- GB12: sot (RI RI)* RI × RI - Do not break within emoji flag sequences
- GB13: [^RI] (RI RI)* RI × RI - Do not break within emoji flag sequences
These rules state that Regional Indicators should pair up - a sequence like RI RI RI RI should break as [RI RI] [RI RI].
The critical requirement: lookbehind is necessary to count how many RI characters precede the current position. A boundary occurs at even-numbered RI positions (0, 2, 4, 6...).
Qt's current implementation (in qtbase/src/corelib/text/qunicodetools.cpp) appears to be missing this lookbehind logic, treating all consecutive RI characters as a single grapheme cluster.
Suggested Fix
// When checking boundary at position between char[i-1] and char[i]
if (char[i] is Regional_Indicator) {
    if (char[i-1] is Regional_Indicator) {
        // Count consecutive RIs backward from position i-1
        int riCount = countBackwardRI(i - 1);
        
        if (riCount % 2 == 1) {            
            // Odd number before = we're the 2nd of a pair            
            // Do NOT break (GB12/GB13)            
            return NO_BOUNDARY;        
        } else {            
            // Even number = start of new pair            
            return BOUNDARY;        
                 }
    } else {        
        // First RI after non-RI = start of new pair        
        return BOUNDARY;    
         }
}
Implementation Location:
Modify the grapheme cluster breaking logic in qtbase/src/corelib/text/qunicodetools.cpp, specifically in the function that checks the Grapheme_Cluster_Break property.
Reference Implementations
Correct implementations can be found in:
- Rust unicode-segmentation: https://github.com/unicode-rs/unicode-segmentation/blob/master/src/grapheme.rs
	- See the Regional state enum and lookbehind logic around line 200
 
- libgrapheme: https://libs.suckless.org/libgrapheme/
	- Uses flip-flop state flag GRAPHEME_FLAG_RI_ODD