Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-121907

Consider using Unicode data for Indic script segmentation

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • P3: Somewhat important
    • None
    • None
    • None
    • All
    • 0e67553aa (dev)

    Description

      Unicode 15.1 introduced rules for line breaking (https://www.unicode.org/reports/tr14/tr14-51.html#LB28a) and grapheme clustering (https://www.unicode.org/reports/tr29/#GB9c) for Indic scripts. Both those rules are not trivial to add to the existing code due to different form than the already supported rules. The grapheme clustering rule also requires extracting another Unicode data file (Indic_Conjunct_Break rule from DerivedCoreProperties.txt). Also, Qt already implements custom segmentation rules for some of the scripts in qunicodetools.cpp.
      It would be good to adapt the segmentation code to use the Unicode data instead of the custom tables if possible. This would also allow using the Unicode test data for Indic scripts.

      For now I'm going to disable parts of tst_qtextboundaryfinder that depend on those new rules.

      Strangely, Unicode line breaking and grapheme clustering handle disjoint set of scripts. Line breaking seem to support Balinese, Javanese, Brahmi, Grantha, Dives and Kawi. InCB properties are set for Devanagari, Bengali, Gujarati, Oriya, Telugu and Malayalam (except for Extend property that is set for more scripts, some are not Indic/Brahmic).

      Qt has special rules for Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam and Sinhala, so using only Unicode grapheme clustering would be a regression. There are also separate rules for Thai, Tibetan, Myanmar and Khmer.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              cnn Qt Core & Network
              ievgenii.meshcheriakov Ievgenii Meshcheriakov
              Vladimir Minenko Vladimir Minenko
              Alex Blasche Alex Blasche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes