[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Sat Apr 2 08:24:52 CDT 2022

On 4/2/22 20:19, Rémi Bernon wrote:
> On 4/2/22 12:51, Jin-oh Kang wrote:
>> Wouldn't it make much more sense if we simply copied optimized copy
>> routines from other libc implementations? They have specialised
>> implementations for various architectures and microarchitectures (e.g.
>> cache line size), not to mention the performance enhancements that have
>> accumulated over time.
>>
> 
> 
> The question is, do we really need and want the complexity induced by hand-crafted assembly (or intrinsics) routines?
> 
> * at build time but also runtime, we'll need to carefully check hardware capability,

We already do this for SSE2 on i386, and for FXSAVE/XSAVE/XSAVEC on both i386 and x86-64.
At build time we can simply disable SSE/AVX routines on old enough GCC.

> 
> * it increases maintenance burden as they may need to be updated when hardware performance profile changes, or when new features are added,

As long as correctness and (any sort of) performance advantages are preserved, no further maintenance effort would be _strictly_ necessary.

We can set up performance regression tests for C vs. SIMD implementations, and possibly revert to C version if the gap ever becomes severe enough (which is unlikely).

> 
> * other libc implementation may be hard to integrate in our code base, especially if they rely on some dispatch mechanism or assembly source,

We only copy (and adapt) the implementation and _not_ its supporting infrastructure.  Also we don't really have to do it for every string routine; we merely need to do so only for crucial ones.

> 
> Or do we want to rely as much as possible on the compiler to do it for us?
> 
> I don't know the rationale behind the choice of the other libc, but as far as I understand for Wine an efficient C implementation is usually preferred over assembly,

We may as well copy efficient C implementations from other libcs.  It also avoids the 3 problems you've pointed out.

> unless a convincing argument is made that doing it in assembly significantly improves things for some applications.

Most modern LIBCs (and presumably MSVCRT as well) using SSE/AVX is a convincing argument in and of itself.

I think what we're specifically asking here is whether "performance benefits reaped from optimizing string routines outweighs the maintenance burden imposed by the use of machine-specific instructions."
Personally I haven't run into a case where msvcrt string routines shows up as a bottleneck in perf profiling, but others can chip in and share their numbers.

> 
> 
> (I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
> 
> 
> 
>> Also worth noting is that Wine is licensed under LGPL, which makes it
>> compatible with most open-source libcs out there. Basically what we would
>> need is some ABI adaptations, such as calling convention adjustment and SEH.
>>
>> Another option is to just call system libc routines directly, although in
>> this case it might interfere with stack unwinding, clear PE/unix
>> separation, and msvcrt hotpatching.
>>
> 
> 
> Calling the system libc will need a "syscall", and will most likely defeat any performance improvement it could bring.

Yes.  This is exactly what I meant by interference with/from "clear PE/unix separation," or lack thereof.

-- 
Sincerely,
Jinoh Kang