Details
-
Task
-
Resolution: Unresolved
-
P3: Somewhat important
-
None
-
None
-
None
-
-
0e67553aa (dev)
Description
Unicode 15.1 introduced rules for line breaking (https://www.unicode.org/reports/tr14/tr14-51.html#LB28a) and grapheme clustering (https://www.unicode.org/reports/tr29/#GB9c) for Indic scripts. Both those rules are not trivial to add to the existing code due to different form than the already supported rules. The grapheme clustering rule also requires extracting another Unicode data file (Indic_Conjunct_Break rule from DerivedCoreProperties.txt). Also, Qt already implements custom segmentation rules for some of the scripts in qunicodetools.cpp.
It would be good to adapt the segmentation code to use the Unicode data instead of the custom tables if possible. This would also allow using the Unicode test data for Indic scripts.
For now I'm going to disable parts of tst_qtextboundaryfinder that depend on those new rules.
Strangely, Unicode line breaking and grapheme clustering handle disjoint set of scripts. Line breaking seem to support Balinese, Javanese, Brahmi, Grantha, Dives and Kawi. InCB properties are set for Devanagari, Bengali, Gujarati, Oriya, Telugu and Malayalam (except for Extend property that is set for more scripts, some are not Indic/Brahmic).
Qt has special rules for Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam and Sinhala, so using only Unicode grapheme clustering would be a regression. There are also separate rules for Thai, Tibetan, Myanmar and Khmer.
Attachments
Issue Links
- resulted from
-
QTBUG-121529 Update UCD to version 32 (Unicode 15.1)
- Closed