Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-139922

Sanity-check handling of invisible markers in CLDR's Unicode for signs

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: P2: Important P2: Important
    • None
    • 5.0.0, 6.10
    • Core: Locales (i18n)
    • None
    • All

      In the course of writing tests for QLocaleData::NumericData I have noticed that several locales that use an ASCII dash as minus sign combine it in various rather haphazard-seeming ways with Unicode BIDI markers and, in one case, the Arabic letter marker.

      Key to shorthand below, dramatis personae or rogues gallery (nick, U+xxxx, Unicode name):

      • PLUS: U+002B PLUS SIGN (ASCII)
      • DASH: U+002D HYPHEN-MINUS (ASCII)
      • ALM: U+061C ARABIC LETTER MARK
      • L2R: U+200E LEFT-TO-RIGHT MARK (BiDi)
      • R2L: U+200F RIGHT-TO-LEFT MARK (BiDi)
      • MINUS: U+2212 MINUS SIGN (canonical)

      Examples found when checking our CLDR-derived data for variations on the minus sign (in each case, with either DASH or MINUS replaced by PLUS for the plus sign, decorated the same way), all in use with Arabic digits:

      • ar (ar-Arab-EG): ALM DASH
      • pa-PK (pa-Arab-PK): L2R DASH L2R
      • ckb-Arab-IQ: R2L DASH (Southern Kurdish)
      • ar-Arab-IN: L2R DASH
      • fa-Arab-IR: L2R MINUS

      Grounds for suspicion:

      • DASH is not a letter, so marking it with ALM is suspect
      • DASH doesn't have an intrinsic left-to-right or right-to-left nature, and is indeed also used in Arabic as a hyphen between words as well as a minus sign, so a L2R before it (as part of a number, with digits read right-to-left, even though surrounding text is left-to-right) makes sense (as in ar-Arab-IN and fa-Arab-IR), but a second after it is fatuous.
      • Likewise, an L2R would be superflous if the sign appeared in a L2R context, such as just after a preceding number.
      • R2L (Southern Kurdish) would seem to suggest digits are in right-to-left order, which is the reverse of Arabic's order.
      • If contributors copied-and-pasted into a Unicode.org tool gathering data, they would not see that what they'd copied included some invisible markers, so may be oblivious to what they're pasting, thinking they're just pasting a DASH or MINUS; this would be consistent with the seemingly haphazard mixture.

      At present we take what we find in CLDR verbatim and only consider a text to have a sign in it if the DASH or MINUS appears with the exact pattern of invisible markers that we recorded; this is very likely misguided.
      We should probably

      • When parsing CLDR, check for any such invisible marks and (at least for R2L scripts) discard them when parsing the CLDR data (before transcribing to QLocaleXML format).
      • When parsing a sign in QLocale's code, quietly accept such markers (at least for locales whose textDirection() is Qt::RightToLeft) before and/or after the sign.

      It is possible similar treatment may also be appropriate for the various separators in number formats (digit grouping, fractional part, exponent). The LDML spec makes no mention of invisible companions to the sign characters where it describes how they are identified. In no common/main/*.xml file of the CLDR distribution does any locale have anything visible aside from MINUS and DASH within any <minusSign> element; nor does any <plusSign> element contain anything visible aside from PLUS.

      I have asked the Unicode CLDR folks about what's really going on here and shall summarise responses here, then decide what course of action we should pursue. Initial responses point to trusting the CLDR forms as the correct form to use when serialising numbers, but ignoring the invisibles when parsing them.

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Eddy Edward Welbourne
            Eddy Edward Welbourne
            Vladimir Minenko Vladimir Minenko
            Alex Blasche Alex Blasche
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:

                There are no open Gerrit changes