Unicode normalization for Wine

Tue Jul 25 16:33:51 CDT 2017

Dear All,

My name is Artur and I'm participating in Google Summer of Code 2017 for Wine.
Under Nikolay's supervision, I'm working on implementation of Unicode
normalization. I probably should have introduced myself some time ago to share
results of my research and my ideas, but I also wanted to wait until I could
illustrate my points with some code.

I'm not sure whether everyone is familiar with Unicode normalization forms, so
here's a short summary of concepts and terms I may use. For detailed          
information, please see [1].

- Characters that look the same on the screen may be encoded in a few different
  ways. Accented Latin letters may constitute a familiar example.
  The letter 'é' can be U+00E9 (LATIN SMALL LETTER E WITH ACUTE), but it also
  can be represented as the sequence U+0065 U+0301 (LATIN SMALL LETTER E,
  COMBINING ACUTE ACCENT). Many of 1114112 (if I remember the number correctly)
  Unicode characters are decomposable, i.e. they can also be represented by a
  different sequence of characters.

- Text in normalization form NFD ('D' stands for 'decomposition') does not
  contain any decomposable characters, e.g. there is U+0065 U+0301 instead of
  U+00E9. The combining characters (such as the above-mentioned acute accent)
  have a well defined order, using the concept of combining classes. [2 §3.11]

- Text in normalization form NFKD ('K' stands for 'compatibility') has the
  properties of NFD, plus characters that are typographical variants of other
  characters (e.g. superscripts, ligatures, etc.) are mapped to 'plain'
  characters.

- Text that will be normalized into the form NFC and NFKC ('C' stands for
  'composition') has to be fully decomposed first (NFD and NFKD, respectively)
  and starters (characters with a combining class zero) plus following non-
  starters (combining class greater than zero) can be composed into single
  characters provided that they exist in Unicode and are not present in [3].

I've developed a draft implementation of NFD & NFKD (mapping characters to
decomposed sequences as well as reordering combining characters) and written
a test program that uses official Unicode test data. [4]

My goal was to provide support for codepoints greater than 0xFFFF, encoded
in UTF-16 as surrogate pairs. More on that below.

The attachment [gsoc-v2.patch] is a diff of my working branch against master.
Most notable changes include:

- Necessary changes to tools/make_unicode.

- New function wine_decompose_string().

- Draft implementation of two Windows API functions: NormalizeString() and
  IsNormalizedString() (the latter temporarily returns TRUE). [5][6]

Rationale and comments:

- tools/make_unicode did not collect all necessary data (most notably,
  compatibility mappings), so a new data structure, @decomp_table_full has
  been added. Its unification with @decomp_table is yet to be resolved.

- The procedure dump_decompose_table() in that script has been rewritten
  to accomodate all aspects described above and below. Two helper subroutines
  have been added.

- However, the C code section embedded within the script has grown
  substantially, so it may be desirable to move it to a separate file.

- Adding wine_decompose_string() makes sense since surrogate pairs behave like
  a string and the Unicode Canonical Ordering Algorithm (for combining
  characters) is an in-place bubble sort on the string, per Unicode
  specification.

- While the code in libs/port/decompose.c (and embedded in tools/make_unicode)
  implements normalization, it currently serves the purpose of confirmation
  that the generated data is correct. It hasn't been yet optimized or cleaned
  up, because the organization of the data is open to discussion now, so the
  code may still change substantially.

- It has been previously assumed that a decomposition mapping of any
  character is two characters long. This is certainly not true for NFKD,
  where a character like U+FDFA decomposes to a string of length 18.
  Therefore, the tables in libs/port/decompose.c contain pairs of offsets to
  null-terminated mappings in another table (decomp_data, first in that file),
  one entry for NFD and one for NFKD, instead of two-character mappings.
  Moreover, since many NFD and NFKD are identical, they have been placed in
  the same table, with identical mappings merged and fully expanded (i.e. any
  decomposable characters in the data have been replaced by fully decomposed
  sequences), to minimize runtime overhead.

- The original idea of multi-stage tables (using the most significant bits of
  a codepoint to obtain an offset to another table etc.) has been preserved,
  although the tables are no longer combined into a single table. This is
  because some sub-tables can use a narrower data type, which saves space,
  especially when data for all Unicode codepoints is included.

- Tables of combining classes have a similar structure.

- Mappings for characters above 0xFFFF are encoded as UTF-16 (using surrogate
  pairs), but a single codepoint (UTF-32 if you like) is used for table
  indexing. Setting $utflim in make_unicode to 65536 is the simplest way to
  disable support for such characters, but supporting surrogate pairs should
  not affect any text-related Wine component in a negative way.

- As composition depends directly on decomposition, it is the next thing to
  work on.

The attachment [unorm_test.c] is my test program. It should build fine with
mingw-w64 or winegcc without any specific compiler flags. It looks for [4] in
its directory or accepts a path to that file as an argument. Option -d makes
it run only decomposition tests, option -s limits the range to 0xFFFF. No
tests should fail with the -s option and 80 of 187460 should fail without it
due to incomplete support for surrogate pairs in the reordering part. For
details of failed tests, please run with the -v option. Is there a place for
such a program somewhere in the Wine source tree?

I'd be more than happy to answer your questions, address your comments, and
listen to your opinions regarding Unicode normalization.

Regards,
Artur Świgoń

[1] http://www.unicode.org/reports/tr15/
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
[3] http://www.unicode.org/Public/10.0.0/ucd/CompositionExclusions.txt
[4] http://www.unicode.org/Public/10.0.0/ucd/NormalizationTest.txt
[5] https://msdn.microsoft.com/en-us/library/windows/desktop/dd319093(v=vs.85).aspx
[6] https://msdn.microsoft.com/en-us/library/windows/desktop/dd318671(v=vs.85).aspx
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gsoc-v2.patch
Type: application/octet-stream
Size: 349854 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20170725/5ee84033/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unorm_test.c
Type: application/octet-stream
Size: 10413 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20170725/5ee84033/attachment-0003.obj>
-------------- next part --------------