[PATCH 1/3] kernel32: Support UTF-7 in MultiByteToWideChar. (try 3)

Thu Oct 16 16:48:33 CDT 2014

2014-10-16 6:25 GMT-06:00 Michael Stefaniuc <mstefani at redhat.com>:
> On 10/16/2014 01:38 AM, Alex Henrie wrote:
>> 2014-10-15 15:15 GMT-06:00 Michael Stefaniuc
>> <mstefani at redhat.com>:
>>> On 10/15/2014 03:16 AM, Alex Henrie wrote:
>>>> 2014-10-14 16:12 GMT-06:00 Michael Stefaniuc
>>>> <mstefani at redhat.com>:
>>>>>> +    /* MultiByteToWideChar guarantees that srclen > 0 */ +
>>>>>> assert(srclen > 0);
>>>>> This assert isn't needed. Just rewriting the first do..while
>>>>> loop to a while loop avoids it. That eliminates 3 lines of
>>>>> code of which one is a comment pointing to an unneeded
>>>>> assert.
>>>>
>>>> A while loop adds an unnecessary check, but whatever. You and
>>>> Sebastian Lackner both requested it, so I concede.
>>> Uh? Trying to avoid a loop check with a do..while loop is a
>>> micro-optimization. That's the job of the compiler. Our job is
>>> to optimize the code for readability.
>>
>> There's no way that the compiler is verifying that the passed
>> srclen is always greater than 0 and internally converting the loop
>> to a
> It doesn't have to verify srclen as it already knows it is > 0.
> As MultiByteToWideChar() is the sole user of your function and it uses
> it only from one place it will most likely just inline it.

OK, you're right. At O1 or higher, the compiler appears to inline the
function and rewrite the loop as a do-while loop.

> Reviewer Cost:
> ==============
> Now to the costs of the do while loop on the reviewer side:
> - Look at the diff in the email client
> - What's that first do while loop for? Scroll down as the big comments
>   make the loop end not fit on one page.
> - Uh, he isn't checking srclen to be > 0 and uses it as array index.
>   Meh.

Let's imagine that the reviewer is looking at the while loop version.
He might say "why is this code using a while loop if the condition is
guaranteed to be true the first time through?" Then he'd have to try
to think through whether the compiler will optimize that out or not.

This whole problem could be avoided with a do-while loop and one line
of comment explaining that srclen is guaranteed to be > 0.

>> I have now rewritten the tests to programatically generate the
>> input, but there is no way to programatically generate the output
>> without duplicating pretty much the entire Windows UTF-7 algorithms
>> in the tests, which would defeat the purpose of having tests.
> If you are just want to check the exact mappings that Windows does
> between WCHAR and UTF-7 it isn't hard to generate.  And you can even
> check the encoding for the whole WCHAR range with a for loop from 0 to
> WCHAR_MAX:
> - At 64K WCHARs the src buffer will eat just 128KB. Fill it with a for
>   loop.
> - The dest buffer assuming a worst case 1 WCHAR to 4 char encoding will
>   be just 256KB.
> - WideCharToMultiByte() that.
> - Calculate the SHA1 sum of the UTF-7 stream and compare it to a
>   previously computed SHA1.
> - MultiByteToWideChar the UTF-7 stream you got. Either compare it
>   directly to the WCHAR src if the operation happens to be symmetric or
>   calculate again the SHA1 sum and compare it to the stored.
>
> Then you can reserve the other tests for the interesting stuff where you
> really have to store both src and dest. But those should be short
> strings anyway.

Let's imagine that someone else tinkers with the UTF-7 code, and
accidentally introduces a bug. If the tests only compare hashes, how
would he have any idea what exactly broke?

If the purpose of the tests is purely to check conformance, then
hashes would be fine. But if you want them to aid in debugging as
well, they need to compare against unprocessed output.

On second thought, I've rewritten the large tests per your previous
suggestion. They no longer check that the exact UTF-7 encoding is
correct, but this is not a big deal because other tests already check
exact UTF-7 encodings thoroughly. The new tests instead test each
character, including characters 0 through FFFF if testing WCHARs.

-Alex