Uploaded image for project: 'Qt'
  1. Qt
  2. QTBUG-111801

Emoji font selection in Qt

    XMLWordPrintable

Details

    • User Story
    • Resolution: Unresolved
    • P2: Important
    • None
    • 6.5.0 Beta3
    • GUI: Font handling
    • None

    Description

      There are a few related issues with how we do font selection in Qt compared to end user expectations. This becomes especially visible for emojis, since this is an ad hoc quasi-writing system with lots of ad hoc concepts slapped on top. As a result, it looks like all frame works and web browsers have a lot of special casing for emojis internally, with the main purpose of preferring color fonts over others.

      Some overview: Emojis is not a writing system in itself, but usually refers to the color representation of certain characters. While Unicode only defines these as pictographs (which is also the literal meaning of the word), there is a vague idea that a pictograph can have a "text" and "emoji" representation, which is essentially monochrome versus multicolor.

      The correct representation of emojis is mainly based on OpenType/AAT font rules, such as glyph substitution. Sequences of characters may be treated using similar approaches as ligatures and diacritics during text shaping.

      For instance, a character such as the airplane pictograph, which predates emojis by a couple of decades, may be suffixed by the "diacritic" (non-spacing mark) character variation-selector-16. When this occurs, the user will expect an "emoji" representation of the airplane, which typically means it should be multicolored.

      The following string should show a colorful plane:

      \u2708\ufe0f
      

      In addition, formatting characters such as the zero-width joiner may be used to create additional emojis which are not part of Unicode. For example, the combination of a white flag pictograph and the transgender symbol joined by a zero-width joiner will by some fonts be recognized as the transgender flag, although this flag is not yet recognized as a character in Unicode.

      The following string should be displayed as the transgender flag (e.g. in Noto Color Emoji)

      \ud83c\udff3\u200d\u26a7
      

      Similarly, U+200D may be used to construct emojis with specific skin colors etc.

      These can be combined, so that you can get very complex strings of modifiers to signify a single glyph:

      This is another way to write the transgender flag emoji above, but with explicit modifiers to ensure color emoji representation for the white flag and transgender symbol characters:

      \ud83c\udff3\ufe0f\u200d\u26a7\ufe0f
      

      Most of this is based on convention, but font rules created for tackling complex writing systems are used, with the benefit that existing font formats and shapers can handle this without any additional work.

      For font selection, however, it poses a challenge, since emojis are not a writing system, but just a vaguely defined expectation that some characters are represented with colors.

      The core issue is that in order for the shaper to work, it must apply the same font to all the characters in the above string. If Harfbuzz does not get the full string of characters, then it cannot apply the correct font rules. Like most other frameworks, Qt supports font merging, which means that if there are characters in the string which are not supported by the selected font, then fallback fonts will be used to make sure you avoid missing character glyphs in the output.

      The way this works in Qt is that we first check which characters are supported by the selected main font. We then do a pass over all the remaining characters and check a platform-specific fallback list for alternative fonts, preferring the first font to contain an entry for the code point in its CMAP.

      There are multiple issues observed with this:

      1. If the selected font supports the pictograph (e.g. \u2708) and not the emoji representation, then you may end up with a broken representation. (QTBUG-108799)
      2. If the font has rules to process something like "\ud83c\ufe0f\udff3\u200d\u26a7\ufe0f" but does not have CMAP entries for individual characters such as \ufe0f, it will not be selected for the full string by Qt's font selection system (the \ufe0f ends up being mapped to a different font instead). (QTBUG-97401)
      3. Since emoji is an expected representation of a pictograph and not a writing system, some fonts may support part of an emoji sequence, but not the full thing. If selected, they will then be used for the characters they support and Qt will only select a fallback for the missing parts. (QTBUG-97400)
      4. Some fonts, such as DejaVu Sans, may contain entries for both the pictographs and the VS-16 character, although it does not process the sequence correctly according to user expectations, so the end result is monochrome, even though the emoji selector is present. This is probably a bug with the font itself, and if a user explicitly selects DejaVu Sans, then the expectation would be that they get the representation designed in that font. However, when you do not manually specify any font family, Qt will pick a single main font to represent the string based on system settings. A possible improvement might be to select different fonts for different substrings (or maybe only special-case emoji substrings, since writing system-based font selection typically works fine). We might also have to add special cases for DejaVu Sans so that it is not selected as fallback for variation-selector characters, since it does not apply them correctly. (QTBUG-85744)
      5. Characters like joiners and combining characters are also expected to be joined as a single character before being passed to the shaper and gets messed up if Qt selects a different font for one compared to the other. (QTBUG-111324 and QTBUG-85014)

      At its core, the issue is not with emojis, but with how Qt applies font selection to individual characters, assuming that the writing system binds them together. Emoji is an ad hoc concept which builds on existing font features but expands quite significantly on how these are applied in real life, and in other engines there is apparently signficant amounts of hacks and work-arounds to address this.

      One possible approach which could solve several of the cases would be to tokenize the string initially, so that we do not do any matching for individual characters, but always aim to keep sequences together - preferring fonts that support the full sequence above all else. So a combining character or diacritic would always be in a token together with its preceding character, and a joiner would latch onto both the preceding and succeeding characters.

      This would get us a certain distance, but the remains that some fonts do not have CMAP entries for all the characters in the sequence, despite supporting them as part of the OpenType rules. Specifically this seems to be an issue with the VS-16 selector. It's hard to make a general approach here: We could say that if no font supports all characters in a token, then we prefer the font which supports the majority of the characters and just assume it will also tackle the selector. However, this will still end up preferring DejaVu Sans in some cases, since it does support the full range of characters. Another option might be to run Harfbuzz on the "token" substrings and see which font grants us the fewest glyphs, but that is really arbitrary and probably too much of a performance penalty.

      We could also ignore the problem of DejaVu Sans "stealing" the tokens for now, and apply some hardcoded blacklisting rules later to work around that, granted that this problem is not typical, and that most fonts which support the VS-16 will provide color representations.

      Another issue to address is the fact that we query fallback fonts only based on the writing system, but the platforms typically have a default color emoji font as well. If we do this tokenization, we can annotate the emoji tokens (based on whether the character falls in the emoji range or if the VS-16 character is present) and query the system for the default emoji font instead. This would both be more efficient than searching all fallback fonts for the emoji, but also improve our chances of getting the correct font rather than some random font which happens to support the characters in the string.

      Attachments

        Issue Links

          For Gerrit Dashboard: QTBUG-111801
          # Subject Branch Project Status CR V

          Activity

            People

              esabraha Eskil Abrahamsen Blomfeldt
              esabraha Eskil Abrahamsen Blomfeldt
              Votes:
              5 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are 3 open Gerrit changes