[PATCH 1/3] kernel32: Support UTF-7 in MultiByteToWideChar. (try 3)

Thu Oct 16 07:25:36 CDT 2014

On 10/16/2014 01:38 AM, Alex Henrie wrote:
> 2014-10-15 15:15 GMT-06:00 Michael Stefaniuc
> <mstefani at redhat.com>:
>> On 10/15/2014 03:16 AM, Alex Henrie wrote:
>>> 2014-10-14 16:12 GMT-06:00 Michael Stefaniuc
>>> <mstefani at redhat.com>:
>>>>> +    /* MultiByteToWideChar guarantees that srclen > 0 */ +
>>>>> assert(srclen > 0);
>>>> This assert isn't needed. Just rewriting the first do..while
>>>> loop to a while loop avoids it. That eliminates 3 lines of
>>>> code of which one is a comment pointing to an unneeded
>>>> assert.
>>> 
>>> A while loop adds an unnecessary check, but whatever. You and 
>>> Sebastian Lackner both requested it, so I concede.
>> Uh? Trying to avoid a loop check with a do..while loop is a 
>> micro-optimization. That's the job of the compiler. Our job is
>> to optimize the code for readability.
> 
> There's no way that the compiler is verifying that the passed
> srclen is always greater than 0 and internally converting the loop
> to a
It doesn't have to verify srclen as it already knows it is > 0.
As MultiByteToWideChar() is the sole user of your function and it uses
it only from one place it will most likely just inline it.

> do-while loop as a consequence. The compiler is not that smart.
Even if the compiler isn't that smart it just doesn't matters.  Even
ignoring the fact that you're trading a check in the while loop for a
check in assert() it still doesn't matters.

> Nevertheless, I'll leave it as a while loop for you.
As I'm passionate about this topic and to save me time in the future by
just referencing this email here is a tl;dr full explanation.

Computer Cost:
==============
Here is a quick back of the envelope calculation for the costs of the
extra check:
- The cost for the check in the do while loop will be srclen * cost
  whereas for the while loop it will be (srclen+1) * cost.
- For big srclen the +1 becomes irrelevant. Worst case is srclen == 1
  with two times the cost of the check.
- What's the cost of the loop check compared to the body of the loop?
  Lets pick 10% so the overhead of the extra check will melt down to
  0.1 * cost of the check in the worst case scenario of srclen == 1.
- Reduce the cost overhead again for the whole MultiByteToWideChar()
- Factor in the fact that the call to MultiByteToWideChar() won't be in
  the hot path of the application.
- Also the absolute cost of the extra check is cheap and a modern CPU
  might just paper over it with its branch prediction.
- So even with a synthetic test using srclen == 2 I could bet you will
  have a hard time measuring the overhead. And that only if the compiler
  is stupid of course.

Reviewer Cost:
==============
Now to the costs of the do while loop on the reviewer side:
- Look at the diff in the email client
- What's that first do while loop for? Scroll down as the big comments
  make the loop end not fit on one page.
- Uh, he isn't checking srclen to be > 0 and uses it as array index.
  Meh.

Reviewer will branch at that point:
- Code is so bad it isn't worth spending time on it. Alexandre with
  you're not checking the srclen. But I figure others have done it too.

A few will continue:
- Why is he doing that? At 2+ Million lines of code we can't just
  remember all the gory details.
- Switch to the window / terminal / desktop with the Wine source code
- Load the locale.c file into the editor
- Look for MultiByteToWideChar()
- Check that indeed srclen cannot ever be < 0
- Go back to review the diff. Oh where was I?

So minutes of reviewer time wasted on a micro optimization.
Heavily reduced the number of reviewers willing to look at your patch.
For what? Just for a couple of CPU cycles if and only if the compiler is
that stupid and the CPU isn't that smart to guess it right in the branch
prediction.

Reviewer time is the *most scarce* resource, even scarcer than developer
time writing the code in the first place.
CPU cycles on the other hand are plenty and cheap.

>>>>> +                    sextet =
>>>>> base64_decoding_table[sextet]; +                    if
>>>>> (sextet == (WCHAR)-1)
>>>> Unneeded cast. Using a signed type for sextet also avoids
>>>> the -Wsign-compare compiler warning. Alternatively use ~0
>>>> instead of -1 if you want to keep an unsigned type.
>>> 
>>> Using a signed char in an array subscript causes
>>> Wchar-subscripts. I'd just be trading one warning for another.
>> No, it doesn't. Wchar-subscripts complains only about the naked
>> char as that can be either signed or unsigned depending on the
>> platform. Using signed char lets the compiler know that you're
>> aware of the issue and it will leave you alone.
> 
> OK. I've switched to signed char and eliminated this typecast.
> 
>>>>> +                        /* -1 means that the next
>>>>> character of src is not part of a base64 sequence */
>>>> This upper part of the comment is better placed in the
>>>> declaration of base64_decoding_table.
>>> 
>>> I really think it makes more sense where it is.
>> I was trying to reduce the verbosity of utf7_mbstowcs(). The
>> multi line comments drown the code in between. If you can fit the
>> essence into one comment line, that works too. utf7_wcstombs(),
>> which has less comments, is much more pleasant to read.
> 
> This is where we just disagree. I think the comments are helpful,
> both
Sorry, I didn't mean having just one line of comment for the whole
function. Just condensing the multi line comments in preferably one or
two lines of comment. E.g. on the first review I was wondering why you
have an else branch with just comments in it, I totally oversaw the
break in there.

> because they help the reader to understand the UTF-7 algorithm and 
> because they explain what behavior is a bug in Windows and what is 
> not. I have, however, removed a couple of the smaller comments
> where the code is self-explanatory.
Good. Self explanatory code trumps comments any time.

>>>> Also some of the strings are huge. Most of those seem to be
>>>> for the WCHAR => UTF-7 and could be probably generated by
>>>> just using the directly_encodable_table[] from patch 2. If
>>>> the tests cannot be generated from the table then please
>>>> split the strings into more manageable chunks, as is they are
>>>> unreadable / unmaintainable.
>>> 
>>> Those strings never need to be changed, ever. It's by far
>>> easier to
>> Being able to change them is just the icing on the top. We cannot
>> read them, that's the problem.
>> 
>>> generate them once and inline them. I'm not sure that splitting
>>> them up will make them much easier to read, either.
>> How should we know those are generated? Or that you didn't miss
>> one of the 256 characters? Those are just big blobs of data. 
>> Tests in Wine serve a dual purpose. They don't only make sure
>> that Wine code won't regress but they also document the found
>> behavior. For that the tests need to be readable, akin to
>> comments.
>> 
>> If you generate the tests then generate them programmatically in
>> the test. Something along the lines of:
>> 
>> /* tests which one-byte characters are base64-encoded and which
>> are not */ for (i = 0; i < 256; i++) { wstr[0] = i; ret =
>> WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str, sizeof(str),
>> ...) if (utf7_can_directly_encode(i)) ok(ret == 1 && str[0] == i,
>> "'%c' not represented directly in UTF-7\n", i); else ok(ret == x
>> && str[0] == '+' && str[x-1] == '-', "..."); }
> 
> This algorithm doesn't check whether the character was actually 
> encoded the way that Windows would encode it. Checking that the
That's not the essence of the test. It just tests what your comment says
it should test.

> character and subsequent characters are encoded correctly is 
> particularly important because of the Windows bug with characters <
> 0.
WCHAR is unsigned so there are no WCHARs < 0. I guess you talk about
WCHARs in the range 128 - 255. That would be of course an extra test for
loop. But see below too.

> I have now rewritten the tests to programatically generate the
> input, but there is no way to programatically generate the output
> without duplicating pretty much the entire Windows UTF-7 algorithms
> in the tests, which would defeat the purpose of having tests.
If you are just want to check the exact mappings that Windows does
between WCHAR and UTF-7 it isn't hard to generate.  And you can even
check the encoding for the whole WCHAR range with a for loop from 0 to
WCHAR_MAX:
- At 64K WCHARs the src buffer will eat just 128KB. Fill it with a for
  loop.
- The dest buffer assuming a worst case 1 WCHAR to 4 char encoding will
  be just 256KB.
- WideCharToMultiByte() that.
- Calculate the SHA1 sum of the UTF-7 stream and compare it to a
  previously computed SHA1.
- MultiByteToWideChar the UTF-7 stream you got. Either compare it
  directly to the WCHAR src if the operation happens to be symmetric or
  calculate again the SHA1 sum and compare it to the stored.

Then you can reserve the other tests for the interesting stuff where you
really have to store both src and dest. But those should be short
strings anyway.

> 
> As before, the latest code is sitting in 
> https://github.com/alexhenrie/wine/commits/master

bye
	michael