[PATCH 1/5] kernelbase/locale: Implement sortkey generation on official tables

Thu Apr 30 14:00:56 CDT 2020

On 4/30/20 1:42 PM, Fabian Maurer wrote:
> Hello Zebediah,
> 
>> Yes, you'd have to iterate over the string multiple times. However,
>> that's a fixed, small number. I wouldn't be concerned about performance
>> until we find evidence that performance is a concern.
> 
> If you say so. I'll see what I can come up with.
> 
> 
>> I don't think you need to write it such that you switch over which
>> weight to write, either. It's probably easier just to make them separate
>> functions, like in my example below.
> 
> That would lead to quite a bunch of code duplication, especially when we come
> to the more difficult to handle characters. I'd like to avoid that. A switch
> for which weight to write should be a lot simpler.

Could be, though I'd warn that attempts to reduce duplication (e.g. by 
introducing complicated macros) can often make code much less readable.

> 
> 
>>> Also keep in mind, that we need to have some temporary buffer anyways,
>>> since weights that are very small only get added when there is a bigger
>>> one following it. Either that, or we need to backtrack and remove weights
>>> again. Due to those issues, I opted for the current approach.
>>
>> Can you please give an example of this? I'm not sure I see it in any of
>> your patches.
> 
> It's in patch 1, inside APPEND_LIST_TO_SORTKEY, see 'statement_is_ignored'.
> And AFAIK not only diacritics are affected.
> Not sure how I would best handle that when writing directly to the sortkey.

Okay, I'm not sure I'm reading this macro right, but this is all 
internal to diacritic weights, right? If I understand, final weights 
which are 1 or 0 get stripped from the end of the sortkey, right? I 
think you could easily handle that by stripping weights after adding all 
of the diacritics, e.g. to use my previous example code,

if (!(flags & NORM_IGNORENONSPACE))
{
     for (i = 0; i < srclen; ++i)
         used += get_diacritic_weights(src[i], dst + used, dstlen - used);

     while (dst[used - 1] < 2)
         used--;
}

>> Generally, you can continue to calculate size but only write when
>> there's space in the buffer. When I've written functions like this I've
>> found it to be relatively simple, though maybe there's some details here
>> that I'm missing...
> 
> Well, there multiple places where we'd need to check. But maybe it'll work
> out, just seemed more complex to me.
> 
> 
>>>> Is there a reason to use LCMapStringEx() here rather than LCMapString()?
>>>
>>> LCMapStringA/W completely ignore the locale, and use a LCID. We'd need to
>>> fix those functions to properly convert the locale. To start simple, I
>>> decided to use the more low-level function and once that works, built the
>>> rest on top.
>> It seems a simple enough fix; just call LCIDToLocaleName().
> 
> Yeah, probably. But I'd rather get the low-level stuff working first. Then I
> can write tests for the higher level functions and then fix them.
> 
> 
>> Generally I believe that explicitly specifying "struct" is preferred as
>> it makes clear what kind of object you're dealing with.
> 
> Okay then.
> 
> Regards,
> Fabian Maurer
> 
> 
>