[PATCH v3 1/2] kernelbase/locale: Implement comparison on top of official unicode weight tables

Wed Mar 4 15:18:00 CST 2020

Hello Alexandre,

> Multi-language support, Japanese, Korean, multi-char sequences,
> surrogates, linguistic mappings, etc.
>
> There are a million things that need to be supported for proper
> sorting. You don't have to implement them all, but it should be clear
> from your approach that they can be added. Which in practice means you
> need to at least prototype most of them.

Well, they can be added, it's just that I left them out for the initial
versions...
Short breakdown:

- Multi-language: The character is looked up the current language, as a
fallback the default is used. Currently, only the default is implemented

- Japanese: Main reason why I did all of this. Special case, but supported by
the tables.

- Korean: Handled under Jamo. Special case, but supported by the tables.
Currently not properly implemented by me because it's a lot of work

- Multi-char sequences: You man when a single codepoint is encoded as more
than one WCHAR? Is supported, windows seems to treat each WCHAR separately

- Surrogates: Windows seems to treat each WCHAR on their own

- Linguistic mappings: Not sure what you mean, sorry

Question: How should I prove it works? I can't possible add all of that in the
first draft.

> For instance you do 10 memory allocations before even starting to
> compare anything. That's clearly not cheap.

I understand. But for a dynamic sized sortkey I need to have dynamic buffers.
Maybe I could put the initial buffers on the stack?

> We only have tests for a very small number of strings, that's clearly
> not proper coverage. Some way of systematically generating test strings
> should be considered.

Like, random strings from a known seed? I intentionally didn't do that,
because of performance concerns.

> Also testing sort keys directly, like you did in
> the first try (but without depending on the exact values).

I've that planned, yes. Do you want that in the first version already?

> When there are differences between Windows versions we want to use the
> latest, since that's the one that will continue to work in the
> future. In this case it means using the most recent table.

Okay then. If that's important, I can change the table.

> Note that we most likely want to use a Windows-compatible NLS file, like
> we are now using for codepage or normalization tables. I can work on
> that part.

I have to admit, I don't know what you mean by that. I don't know about NLS
files.

Regards,
Fabian Maurer