[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Sun Apr 3 00:58:15 CDT 2022

On Sat, Apr 2, 2022 at 11:09 PM Jin-oh Kang <jinoh.kang.kr at gmail.com> wrote:
>
> It's not a real syscall per se; rather, it's more like a gate between the PE side (corresponding to Windows userspace) and the Unix side (Wine's pseudo kernel space which interacts directly with the host OS). The PE/Unix separation is designed so that every interaction with the system goes to the syscall gate, just like on Windows (we're not there yet, but we'll eventually). This helps satisfy video game anti-cheat technologies and conceal the Unix (.so) code which would otherwise cause confusion for Win32 apps and debuggers tracing the execution path.
>

Ah. That makes sense. In this case I think Remi is correct that
there's too much overhead.

>>
>> I can't speak definitively, because it looks a little different for
>> every function. But, overwhelmingly, my experience has been that
>> nothing will run measurably faster than byte-by-byte functions without
>> using vector instructions. Because the bottleneck isn't CPU power, the
>> bottleneck is memory access.
>
>
> It should be.
>

It's a margin of ~25%, versus a margin of ~500%. Unless you're moving
gigabytes it's unlikely to be noticeable.

That said, another confounding issue is the fact that a large number
of small movements will have very different performance
characteristics from a small number of large movements. It's possible
there are cases where using, say, dwords would be much faster than
trying to vectorize. I haven't found them in testing, but this is
another argument for using someone else's code rather than trying to
roll our own - because a library dedicated to this purpose has likely
done all kinds of profiling to find exactly where that threshold lies.

>
> What you're thinking of is a SIMD abstraction library. I don't see how it would be highly necessary, since we're okay with vendor-specific code blocks as long as they are justified. Note that we now only support 4 architectures (IA-32, x86-64, ARM AArch32, and ARM AArch64).
>

Right. The reason I bring it up is because it would satisfy the
requirement to be portable (as long as you stick to the abstraction
library, you're writing regular C) and would get you close enough to
the performance of real intrinsics that it should leave no need for
inline asm. So if we don't want to import another library, this may be
the best compromise between speed and simplicity.