[PATCH 1/3] kernel32: Support UTF-7 in MultiByteToWideChar. (try 3)

Wed Oct 15 18:38:27 CDT 2014

2014-10-15 15:15 GMT-06:00 Michael Stefaniuc <mstefani at redhat.com>:
> On 10/15/2014 03:16 AM, Alex Henrie wrote:
>> 2014-10-14 16:12 GMT-06:00 Michael Stefaniuc <mstefani at redhat.com>:
>>>> +    /* MultiByteToWideChar guarantees that srclen > 0 */
>>>> +    assert(srclen > 0);
>>> This assert isn't needed. Just rewriting the first do..while loop to a
>>> while loop avoids it. That eliminates 3 lines of code of which one is a
>>> comment
>>> pointing to an unneeded assert.
>>
>> A while loop adds an unnecessary check, but whatever. You and
>> Sebastian Lackner both requested it, so I concede.
> Uh? Trying to avoid a loop check with a do..while loop is a
> micro-optimization. That's the job of the compiler. Our job is to
> optimize the code for readability.

There's no way that the compiler is verifying that the passed srclen
is always greater than 0 and internally converting the loop to a
do-while loop as a consequence. The compiler is not that smart.
Nevertheless, I'll leave it as a while loop for you.

>>>> +                    sextet = base64_decoding_table[sextet];
>>>> +                    if (sextet == (WCHAR)-1)
>>> Unneeded cast. Using a signed type for sextet also avoids the
>>> -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if
>>> you want to keep an unsigned type.
>>
>> Using a signed char in an array subscript causes Wchar-subscripts. I'd
>> just be trading one warning for another.
> No, it doesn't. Wchar-subscripts complains only about the naked char as
> that can be either signed or unsigned depending on the platform. Using
> signed char lets the compiler know that you're aware of the issue and it
> will leave you alone.

OK. I've switched to signed char and eliminated this typecast.

>>>> +                        /* -1 means that the next character of src is not part of a base64 sequence */
>>> This upper part of the comment is better placed in the declaration of
>>> base64_decoding_table.
>>
>> I really think it makes more sense where it is.
> I was trying to reduce the verbosity of utf7_mbstowcs(). The multi line
> comments drown the code in between. If you can fit the essence into one
> comment line, that works too. utf7_wcstombs(), which has less comments,
> is much more pleasant to read.

This is where we just disagree. I think the comments are helpful, both
because they help the reader to understand the UTF-7 algorithm and
because they explain what behavior is a bug in Windows and what is
not. I have, however, removed a couple of the smaller comments where
the code is self-explanatory.

>>> Also some of the strings are huge. Most of those seem to be for the
>>> WCHAR => UTF-7 and could be probably generated by just using the
>>> directly_encodable_table[] from patch 2.
>>> If the tests cannot be generated from the table then please split the
>>> strings into more manageable chunks, as is they are unreadable /
>>> unmaintainable.
>>
>> Those strings never need to be changed, ever. It's by far easier to
> Being able to change them is just the icing on the top.
> We cannot read them, that's the problem.
>
>> generate them once and inline them. I'm not sure that splitting them
>> up will make them much easier to read, either.
> How should we know those are generated? Or that you didn't miss one of
> the 256 characters? Those are just big blobs of data.
> Tests in Wine serve a dual purpose. They don't only make sure that Wine
> code won't regress but they also document the found behavior. For that
> the tests need to be readable, akin to comments.
>
> If you generate the tests then generate them programmatically in the
> test. Something along the lines of:
>
>    /* tests which one-byte characters are base64-encoded and which are
> not */
>    for (i = 0; i < 256; i++) {
>        wstr[0] = i;
>        ret = WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str,
> sizeof(str), ...)
>        if (utf7_can_directly_encode(i))
>            ok(ret == 1 && str[0] == i, "'%c' not represented directly in
> UTF-7\n", i);
>        else
>            ok(ret == x && str[0] == '+' && str[x-1] == '-', "...");
>    }

This algorithm doesn't check whether the character was actually
encoded the way that Windows would encode it. Checking that the
character and subsequent characters are encoded correctly is
particularly important because of the Windows bug with characters < 0.

I have now rewritten the tests to programatically generate the input,
but there is no way to programatically generate the output without
duplicating pretty much the entire Windows UTF-7 algorithms in the
tests, which would defeat the purpose of having tests.

As before, the latest code is sitting in
https://github.com/alexhenrie/wine/commits/master

-Alex