[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Sun Apr 3 07:59:53 CDT 2022

On 4/3/22 04:35, Elaine Lefler wrote:
>> On 4/2/22 13:19, Rémi Bernon wrote:
>>> (I personally, believe that the efficient C implementation should come
>>> first, so that any non-supported hardware will at least benefit from it)
>> I also think that it will be good to add more efficient C implementation
>> first (it will also show if SSE2 implementation is really needed).
>>
>> Thanks,
>> Piotr
>>
> 
> I can't speak definitively, because it looks a little different for
> every function. But, overwhelmingly, my experience has been that
> nothing will run measurably faster than byte-by-byte functions without
> using vector instructions. Because the bottleneck isn't CPU power, the
> bottleneck is memory access. Like I said, vectors were created
> specifically to solve this problem, and IME you won't find notable
> performance gains without using them.
> 

Vectorized instructions and intrinsics is just a extension of the idea 
of using larger types to process more data at a time. You can already do 
that to some extend using standard C, and, if you write the code in a 
nice enough way, the compiler may even be able to understand the intent 
and extend it further with vectorized instructions when it believes it's 
useful.

Then it's always a matter of a trade-off between optimizing for the 
large data case vs optimizing for the small data case. The larger the 
building blocks you use, the more you will cripple the small data case, 
as you will need to carefully handle the data alignment and handle the 
border case.

For this specific memcmp case, I believe using larger data types and 
avoiding unnecessary branches, you can already improve the C code well 
enough.

Note that, especially for the functions which are supposed to stop their 
iteration early, you also need to consider whether buffers are always 
entirely valid and if you are allowed to larger chunks of data at a 
time. It seems to be the case for memcmp, but not for memchr for 
instance. [1]

[1] 
https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-valid-buffers/

> Personally I think Jinoh's suggestion to find a compatible-licensed
> library and copy their code is best. Otherwise I sense this will
> become an endless circle of "do we really need it?" (yes, but this
> type of code is annoying to review) and Wine could benefit from using
> an implementation that's already widely-tested.

I personally don't like the idea at all. Copying from other lib code is 
just the best way to get code with no history and which no-one really 
understands the characteristics and the reasons behind it.

Like I said in another thread, the memcpy C code that's been adapted 
from glibc to msvcrt is IMHO a good example. It may very well be 
correct, but looking at it I'm simply unable to say that it is.

Maybe I'm unable to read code, but my first and only impression is that 
it's unnecessarily complex. I don't know why it is the way it is, 
probably for some obscure historical or specific target architecture 
optimization, and, if for some reason we need to optimize it further I 
would just be unable to without rewriting it entirely.

Cheers,
-- 
Rémi Bernon <rbernon at codeweavers.com>