[PATCH 1/3] kernel32: Support UTF-7 in MultiByteToWideChar. (try 3)

Wed Oct 15 16:15:12 CDT 2014

On 10/15/2014 03:16 AM, Alex Henrie wrote:
> 2014-10-14 16:12 GMT-06:00 Michael Stefaniuc <mstefani at redhat.com>:
>>> +/***********************************************************************
>>> + *              utf7_mbstowcs
>>> + *
>>> + * UTF-7 to UTF-16 string conversion, helper for MultiByteToWideChar
>>> + *
>>> + * RETURNS
>>> + *   On success, the number of characters written
>>> + *   On dst buffer overflow, -1
>> What about the case where it fails to decode a character? E.g.
>> wine_utf8_mbstowcs() has -2 for that case which then translates to
>> SetLastError(ERROR_NO_UNICODE_TRANSLATION). I looked at the tests but
>> you didn't seem to test the GetLastError in the cases with the invalid
>> input characters.
> 
> The UTF-7 routines in Windows do not check for Unicode compliance.
> Still, I've added error code checking to all the tests now, which will
> hopefully make that more clear.
> 
>>> +    static const WCHAR base64_decoding_table[] = {
>> Why is this table an array of WCHAR? The entries aren't WCHARs.
> 
> I've changed them to unsigned chars.
You don't have to use the smallest possible type. A short or even int
would work too. And then you can use a signed type which doesn't
visually conflicts with the -1 in the table.

>>> +        /* \0   */ -1,    /* \x01 */ -1,    /* \x02 */ -1,    /* \x03 */ -1,
>> This table is really hard to read even in a editor that does syntax
>> highlighting. As you just need 4 characters per column you could have
>> sixteen columns and merge the comments to one comment at the end of the
>> line.  Something like this would be more readable as well as being more
>> compact:
>>         -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>>  /* 0x00 - 0x15 */
> 
> I've changed it to something similar while still listing every
> individual character:
> 
>         /* \0     \x01   \x02   \x03   \x04   \x05   \x06   \a   */
>            -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,
Still not as compact as I would have done it but it is readable and
that's what matters.

>>> +    /* MultiByteToWideChar guarantees that srclen > 0 */
>>> +    assert(srclen > 0);
>> This assert isn't needed. Just rewriting the first do..while loop to a
>> while loop avoids it. That eliminates 3 lines of code of which one is a
>> comment
>> pointing to an unneeded assert.
> 
> A while loop adds an unnecessary check, but whatever. You and
> Sebastian Lackner both requested it, so I concede.
Uh? Trying to avoid a loop check with a do..while loop is a
micro-optimization. That's the job of the compiler. Our job is to
optimize the code for readability.

>>> +                    sextet = base64_decoding_table[sextet];
>>> +                    if (sextet == (WCHAR)-1)
>> Unneeded cast. Using a signed type for sextet also avoids the
>> -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if
>> you want to keep an unsigned type.
> 
> Using a signed char in an array subscript causes Wchar-subscripts. I'd
> just be trading one warning for another.
No, it doesn't. Wchar-subscripts complains only about the naked char as
that can be either signed or unsigned depending on the platform. Using
signed char lets the compiler know that you're aware of the issue and it
will leave you alone.

>>> +                        /* -1 means that the next character of src is not part of a base64 sequence */
>> This upper part of the comment is better placed in the declaration of
>> base64_decoding_table.
> 
> I really think it makes more sense where it is.
I was trying to reduce the verbosity of utf7_mbstowcs(). The multi line
comments drown the code in between. If you can fit the essence into one
comment line, that works too. utf7_wcstombs(), which has less comments,
is much more pleasant to read.

> 
> 2014-10-14 16:59 GMT-06:00 Michael Stefaniuc <mstefani at redhat.com>:
>>>  dlls/kernel32/tests/codepage.c | 751 ++++++++++++++++++++++++++++++++++
>>> +++++++ 1 file changed, 751 insertions(+)
>> while more tests are good that's a lot of lines to have for just one
>> function. So splitting the WCHAR => UTF-7 from the UTF-7 => WCHAR tests
>> should help make it more manageable.
> 
> The trouble is that the utf16_to_utf7_test's are reversible, so I
> can't cleanly categorize the tests as "encoding tests" or "decoding
> tests", because some tests are both.
Aha.

>> Also some of the strings are huge. Most of those seem to be for the
>> WCHAR => UTF-7 and could be probably generated by just using the
>> directly_encodable_table[] from patch 2.
>> If the tests cannot be generated from the table then please split the
>> strings into more manageable chunks, as is they are unreadable /
>> unmaintainable.
> 
> Those strings never need to be changed, ever. It's by far easier to
Being able to change them is just the icing on the top.
We cannot read them, that's the problem.

> generate them once and inline them. I'm not sure that splitting them
> up will make them much easier to read, either.
How should we know those are generated? Or that you didn't miss one of
the 256 characters? Those are just big blobs of data.
Tests in Wine serve a dual purpose. They don't only make sure that Wine
code won't regress but they also document the found behavior. For that
the tests need to be readable, akin to comments.

If you generate the tests then generate them programmatically in the
test. Something along the lines of:

   /* tests which one-byte characters are base64-encoded and which are
not */
   for (i = 0; i < 256; i++) {
       wstr[0] = i;
       ret = WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str,
sizeof(str), ...)
       if (utf7_can_directly_encode(i))
           ok(ret == 1 && str[0] == i, "'%c' not represented directly in
UTF-7\n", i);
       else
           ok(ret == x && str[0] == '+' && str[x-1] == '-', "...");
   }

It is:
- shorter than the big blob approach
- far more readable
- it better conveys the intent of the test
- trivially to validate

Other tests from wcstombs_tests can be done similarly, with a separate
for loop for better readability.

> 
>> Also the due to the size of the patch it is hard to discern what the
>> difference between the different test structs are. Especially as the
>> names are basically equivalent in this context:
>> - utf16_to_utf7_tests and wcstombs_tests
>> - utf7_to_utf16_tests and mbstowcs_tests
> 
> Could you suggest better names?
Not at this stage.

> 
> Anyway, thanks for taking the time to review this patch set. You made
> a lot of good points. The latest changes (incorporating your
> suggestions) are at https://github.com/alexhenrie/wine
You're welcome.

bye
        michael