-
Bug
-
Resolution: Unresolved
-
P2: Important
-
None
-
6.9, 6.10
-
None
Summary
QTextBoundaryFinder with BoundaryType::Grapheme incorrectly treats consecutive Regional Indicator Symbol pairs (emoji flags) as a single grapheme cluster instead of breaking them into separate pairs. This violates UAX #29 rules GB12 and GB13.
Description
When processing strings containing multiple consecutive emoji flags (Regional Indicator Symbol pairs), QTextBoundaryFinder fails to detect boundaries between flag pairs and incorrectly merges them into a single grapheme cluster.
Example
For the string "A😀🇸🇪🇸🇪A":
- Expected: 5 grapheme clusters: {{A, 😀, SE, SE, A}}
- Actual: 6 grapheme cluster: {{A, 😀, S, ES, S, A}}
Steps to Reproduce
Test Code:
#include <QCoreApplication>
#include <QString>
#include <QTextBoundaryFinder>
#include <QDebug>
void testGraphemeBoundaries(const QString &input, const QString &description) {
qDebug() << "\n=== Testing:" << description << "===";
qDebug() << "Input length (UTF-16 code units):" << input.length();
QTextBoundaryFinder finder(QTextBoundaryFinder::Grapheme, input);
QList<int> boundaries;
int pos = finder.position();
while (pos != -1) {
boundaries.append(pos);
pos = finder.toNextBoundary();
}
qDebug() << "Boundaries found:" << boundaries;
qDebug() << "Grapheme cluster count:" << (boundaries.size() - 1);
// Display each grapheme cluster
for (int i = 0; i < boundaries.size() - 1; ++i) {
int start = boundaries[i];
int end = boundaries[i + 1];
QString cluster = input.mid(start, end - start);
qDebug() << QString(" Cluster %1: [%2:%3] = %4")
.arg![]()
.arg(start)
.arg(end)
.arg(cluster);
}
}
int main(int argc, char *argv[]) {
QCoreApplication app(argc, argv);
// Test case 1: Two Swedish flags (the reported bug)
QString twoFlags = QString::fromUtf8("A😀🇸🇪🇸🇪A");
testGraphemeBoundaries(twoFlags, "Two Swedish flags");
qDebug() << "Expected: 5 grapheme clusters, Got:" << "1 (BUG!)";
// Test case 2: Three Swedish flags
QString threeFlags = QString::fromUtf8("A😀🇸🇪🇸🇪🇸🇪A");
testGraphemeBoundaries(threeFlags, "Three Swedish flags");
qDebug() << "Expected: 6 grapheme clusters";
// Test case 3: Mixed content with flags
QString mixed = QString::fromUtf8("Hello 🇸🇪🇸🇪 World");
testGraphemeBoundaries(mixed, "Mixed content with flags");
// Test case 4: Single flag (should work correctly)
QString oneFlag = QString::fromUtf8("🇸🇪");
testGraphemeBoundaries(oneFlag, "Single Swedish flag");
qDebug() << "Expected: 1 grapheme cluster";
return 0;
}
Correct Behavior per UAX #29:
The string should have 5 boundaries.
Root Cause
According to UAX #29 (Unicode Text Segmentation), the Regional Indicator breaking rules are:
- GB12: sot (RI RI)* RI × RI - Do not break within emoji flag sequences
- GB13: [^RI] (RI RI)* RI × RI - Do not break within emoji flag sequences
These rules state that Regional Indicators should pair up - a sequence like RI RI RI RI should break as [RI RI] [RI RI].
The critical requirement: lookbehind is necessary to count how many RI characters precede the current position. A boundary occurs at even-numbered RI positions (0, 2, 4, 6...).
Qt's current implementation (in qtbase/src/corelib/text/qunicodetools.cpp) appears to be missing this lookbehind logic, treating all consecutive RI characters as a single grapheme cluster.
Suggested Fix
// When checking boundary at position between char[i-1] and char[i]
if (char[i] is Regional_Indicator) {
if (char[i-1] is Regional_Indicator) {
// Count consecutive RIs backward from position i-1
int riCount = countBackwardRI(i - 1);
if (riCount % 2 == 1) {
// Odd number before = we're the 2nd of a pair
// Do NOT break (GB12/GB13)
return NO_BOUNDARY;
} else {
// Even number = start of new pair
return BOUNDARY;
}
} else {
// First RI after non-RI = start of new pair
return BOUNDARY;
}
}
Implementation Location:
Modify the grapheme cluster breaking logic in qtbase/src/corelib/text/qunicodetools.cpp, specifically in the function that checks the Grapheme_Cluster_Break property.
Reference Implementations
Correct implementations can be found in:
- Rust unicode-segmentation: https://github.com/unicode-rs/unicode-segmentation/blob/master/src/grapheme.rs
- See the Regional state enum and lookbehind logic around line 200
- libgrapheme: https://libs.suckless.org/libgrapheme/
- Uses flip-flop state flag GRAPHEME_FLAG_RI_ODD