[PATCH 1/5] kernelbase/locale: Implement sortkey generation on official tables

Thu Apr 30 13:13:34 CDT 2020

On 4/30/20 12:50 PM, Fabian Maurer wrote:
> Hello Zebediah,
> 
> thanks for your reply!
> 
> 
>> So as far as I understand, the sort key algorithm writes the level 0
>> weights (script and alphabetic weight) for the whole string to the sort
>> key, then the level 1 weights (diacritic), and so on, right?
> 
> Yes.
> 
>> In that case, what seems potentially simpler to me is to calculate those
>> weights one level at a time, rather than one character at a time.
> 
> But that would mean that we would iterate over the string multiple times, and
> in the function have a branch as to which weight to write.
> To me, my current approach seemed easier. I've thought quite a while about
> abandoning my lists, but the results ended up being a lot more complex.

Yes, you'd have to iterate over the string multiple times. However,
that's a fixed, small number. I wouldn't be concerned about performance
until we find evidence that performance is a concern.

If you're concerned about readability, I would submit that from an
outside perspective, the current implementation is not very readable.

I don't think you need to write it such that you switch over which
weight to write, either. It's probably easier just to make them separate
functions, like in my example below.

> 
> 
>> That is, you'd end up doing something like
>>
>> static int get_sortkey( DWORD flags, const WCHAR *src, int srclen, char
>> *dst, int dstlen )
>> {
>>      int used = 0;
>>      for (i = 0; i < srclen; ++i)
>>      {
>>          used += get_main_weights(src[i], dst + used, dstlen - used);
>>          if (!(flags & NORM_IGNORENONSPACE))
>>              used += get_diacritic_weights(src[i], dst + used, dstlen -
>> used);
>>          ...
>>      }
>> }
> 
> This won't work, since we first need all main weights, then the others.

Er, yes. I misthought. What I meant to write was

for (i = 0; i < srclen; ++i)
    used += get_main_weights(src[i], dst + used, dstlen - used);

if (!(flags & NORM_IGNORENONSPACE))
{
    for (i = 0; i < srclen; ++i)
        used += get_diacritic_weights(src[i], dst + used, dstlen - used);
}

...

> Also keep in mind, that we need to have some temporary buffer anyways, since
> weights that are very small only get added when there is a bigger one
> following it. Either that, or we need to backtrack and remove weights again.
> Due to those issues, I opted for the current approach.

Can you please give an example of this? I'm not sure I see it in any of
your patches.

> 
>> As that example shows, I also think it's probably simpler to just pass
>> the buffer directly to whatever functions are writing sortkey bytes into it.
> 
> As noted, it's not that easy.
> Especially when you need to deal with buffers that are too short - we'd need
> to mix in a stop condition but still return the needed length. Sometimes, we
> just need to continue iterating despite knowing we can't copy it into the
> sortkey.

Generally, you can continue to calculate size but only write when
there's space in the buffer. When I've written functions like this I've
found it to be relatively simple, though maybe there's some details here
that I'm missing...

> 
> 
>> Is there a reason to use LCMapStringEx() here rather than LCMapString()?
> 
> LCMapStringA/W completely ignore the locale, and use a LCID. We'd need to fix
> those functions to properly convert the locale. To start simple, I decided to
> use the more low-level function and once that works, built the rest on top.

It seems a simple enough fix; just call LCIDToLocaleName().

It's not terribly important, though.

> 
> 
>> The fact that this test is commented out never struck me as great. I'm
>> pretty sure that with todo_wine added as appropriate, it could pass. A
>> first patch in this series could be to do that.
> 
> Probably. I decided to leave it alone for now, and comment it in once it
> works, but sure, I can change that.
> 
> 
>> Are these comments useful?
> 
> Eh, I used them for a quick visual separation. I can remove them if they're
> too much noise though.
> 
> 
>> I get the impression that typedefs have largely fallen out of favour.
> 
> I always use them when I can, since they make the code shorter and I don't
> think having to add the struct keyword adds much. I've seen both in wine code,
> but given a policy I'll stick to that.

Generally I believe that explicitly specifying "struct" is preferred as
it makes clear what kind of object you're dealing with.

>> The fact that exactly one of these integer cases has a symbolic constant
>> attached seems less than ideal.
> 
> True, I'll add constants for the others as well.
> 
> 
> Regards,
> Fabian Maurer
> 
> 
> 
>