Some Japanese fonts has glyphs associated with multiple codepoints. For example, in "Meiryo" which is one of the standard font of Windows Japanese edition, glyph ID 2473 is associated with two codepoints, "日" (U+65E5) and "⽇" (U+2F47).
The former one is basic daily-use Chinese character which means "day" or "sun", and the latter one is code for "radical", i.e. part or component of Chinese character, and rarely used. Number of "Chinese character and its radical" pairs looking identical like above example is around 150. Another standard font, "Yu Gothc" and "Y Mincho" also has these pairs. (I'm not sure but perhaps in Traditional Chinese font "Microsoft JhengHei" for Taiwan people, the situation is same.)
gui/text/fontsubset.cpp seems to generate a ToUnicode table from "glyph ID" subset. When texts which contains U+65E5 is printed to PDFs, generated ToUnicode tables contain U+2F47 as a codepoint for glyph 2473. Because U+2F47 is found earlier than U+65E5 while finding the glyphID-to-Unicode reverseMap. These PDFs has no problem about viewing by human, but serious text garbling occur and it prevents searching and copy-and-pasting.
(This problem is not aware of when the PDF file is viewed with PDF.js on Chrome or Firefox, because PDF.js normalize "radicals" into normal Chinese characters. Please check it by Acrobat Reader.)
How to reproduce
Tested environment: Windows 10 Pro (1903) / MSVC 2019 / Qt 5.12.5
Sample program: main.cpp, main.pro
Sample output file: Test.pdf
The program produces Test.pdf. Open Test.pdf with Acrobat Reader (not PDF.js), copy the second Chinese character. It should be U+65E5, but you get U+2F47.
(Above procedure need Japanese font "Meiryo", which is included in Japanese Supplemental fonts, Windows 10 optional feature link. But needless to reproduce it, the problem is clear at QFontSubset::getReverseMap(), it assumes a single Unicode codepoints for a single glyph.)
(Even if you do not have Japanese fonts, you can see another type of problem related to this issue, at the middle part of Test.pdf. Qt can not correctly generate ToUnicode entry for ligature glyph 'tti' in font 'Calibri'. Because there is single unicode codepoint for 'ffi', but not for 'tti'. Improvement is required for European languages too.)
Ideally, glyphs derived from different codepoints should be distinguished by different glyph ID, and texts should be recoverable in PDFs.
For above sample, first Chinese character should be U+2F47 and second one should be U+65E5, when text is extracted from PDF.
To support multiple codepoints for a single glyph, I think glyph subset should be recorded as a array of glyph-codepoints pair, and ToUnicode CMap should be generated from that array.
A sample patch for this (see fix_pdf_tounicode.patch attached) has been created. My patch seems to improve the behavior (see attached Text_patched.pdf). However, I needed to change the definition of the QFontSubset class, so I do not know if this approach is possible. Besides, I also had to implement ActualText feature, it may affect various other languages / scripts, especially those with complex characters. For example, Arabic case is shown at the bottom of Test.pdf, and the text extraction behavior for this case is considerablly changed by my patch.
If above approach is not possible, as a Plan B, normalization of "radicals" into normal Chinese character, may mitigate the impact of the problem for Japanese. Because use of "radicals" is very rare, and also this normalization is not critical for most people. As a matter of fact, "radicals" is normalized into normal Chinese character by NFKD of Unicode standard. "radicals" exist at between U+2E80 and U+2FFF, and normal Chinese character exist at higher. So, as a dirty hack, in QFontSubset::getReverseMap(), when first found codepoint is "radicals", and second higher codepoint is found, we can adopt higher one. But normalization means for us that it becomes more difficult to use "radicals" and normal Chinese character separately depending on the context, it is not desirable solution.