[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Wed Apr 6 20:56:04 CDT 2022

On Tue, Apr 5, 2022 at 2:14 AM Jan Sikorski <jsikorski at codeweavers.com> wrote:
>
> Hello everyone,
>
> > On 2 Apr 2022, at 06:44, Elaine Lefler <elaineclefler at gmail.com> wrote:
> >
> > Should be noted that SSE2 also exists on 32-bit processors, and in
> > this same file you can find usage of "sse2_supported", which would
> > enable you to use this code path on i386. You can put
> > __attribute__((target("sse2"))) on the declaration of sse2_memcmp to
> > allow GCC to emit SSE2 instructions even when the file's architecture
> > forbids it.
>
> True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.
>

AFAIK this dll will only ever be compiled with mingw-gcc. Should be
safe to assume GCC unless there are plans to support other
cross-compilers.

> > I think this could be even faster if you forced ptr1 to be aligned by
> > byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't
> > reasonably force-align both pointers, but aligning at least one should
> > give measurably better performance.
>
> Right, this memcmp isn’t really an optimized routine, it was not supposed to be. It’s just to get baseline reasonable performance with the simplest possible code. More careful optimizations can follow.

Based on the discussion in these threads, it looks like there's a lot
of inertia in merging patches like this, so it's probably best to be
as optimal as possible on the first go. Just my two cents.

>
> > memcmp is also a function that appears in
> > both dlls. Do you have any input on that?
>
> Not really, I don’t know who uses the one in ntdll and how much they care about speed. Just copy it if needed?
>

Copying it is ok, although I'm concerned about the code duplication.
If you delete the implementation from msvcrt (which links to ntdll)
and then put the optimized version in ntdll, it should result in
msvcrt using the ntdll implementation, which removes the duplication
and gives the optimized code to both dlls.

>
> Besides, a platform independent C version wouldn’t be any simpler--barring Intel’s wonderful naming, I’d say the code is about as trivial as it can get. I don’t think a plain C version would be useful, but if that’s the bar for getting it done, so be it..
>

+1 on this. I don't think a C version would be any simpler.

On Wed, Apr 6, 2022 at 6:02 AM Jinoh Kang <jinoh.kang.kr at gmail.com> wrote:
> >
> > If anything, I'd say a standardized 16-byte data size is the cleaner
> > solution, because you can write preprocessor macros to parse it
>
> Did you mean "maximum vector register size supported by the processor?"
> Note that AVX2 has 32-byte registers, and AVX512 takes it further with 64-byte ones.
> Also worth noting is that ARM64 has Scalable Vector Extensions, which do *not* define the vector size at all. It leaves it to the processor implementation, allowing for seamless vector size expansion.
>

You can't safely run AVX2 instructions without checking the cpu first.
AVX512 is a waste of time imo - Intel actually stopped supporting it
on Alder Lake, it's complicated and developers don't like to work with
it. AMD has never supported AVX512 at all.

It's safe to assume any 64-bit cpu will support 16-byte vectors. If
you want something bigger then you need to write CPU-specific code.

> > it's not very smart. For instance, we have a function named
> > memset_aligned_32, which writes aligned 32-byte aligned chunks. But
> > GCC doesn't know that. It just writes regular quadword instructions.
>
> Wouldn't __attribute__((aligned(32))) work?
>

I think it needs to be enclosed in a struct as well. I use objdump
after compiling to make sure the expected instructions are present.

> > So that's some complicated code which isn't actually better than a
> > straightforward uint64_t loop. I think that's the reason I prefer
> > seeing intrinsics - granted, I have a lot of experience reading them,
> > and I understand they're unfriendly to people who aren't familiar -
> > but they give you assurance that the compiler actually works as
> > expected.
>
> I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.
>

IME, it's almost impossible to hand-write ASM that outperforms a
compiler. You might have to rearrange the C code a bit, but
well-written intrinsics are usually just as good (within margin of
error) as ASM, and much more flexible.

When writing high-performance code it's usually necessary to try
multiple variations in order to find the fastest path. You can't
easily do that with ASM, and it leads to incomplete optimization. For
instance, I was able to write a C memcpy function that outperforms
msvcrt's hand-written assembly, for the following reasons:
- The ASM version has too many branches for small copies, branch
misprediction is a huge source of latency.
- The ASM version puts the main loop in the middle of the function,
leading to a very large jump when writing small data. GCC puts it at
the end, which is more optimal (basically, a large copy can afford to
eat the lag from the jump, but a small copy can't).
- When copying large data it's better to force alignment on both src
and dst, even though you have to do some math.
- Stores should be done with MOVNTDQ rather than MOVDQA. MOVNTDQ
avoids cache evictions so you don't have to refill the entire cache
pool after memcpy returns. Note that this would have been an easy
one-line change even for ASM, but I suspect the developer was too
exhausted to experiment.

So I'm strongly opposed to ASM unless a C equivalent is completely
impossible. The code that a developer _thinks_ will be fast and the
code that is _actually_ fast are often not the same thing. C makes it
much easier to tweak.

> >
> >> Then it's always a matter of a trade-off between optimizing for the
> >> large data case vs optimizing for the small data case. The larger the
> >> building blocks you use, the more you will cripple the small data case,
> >> as you will need to carefully handle the data alignment and handle the
> >> border case.
> >>
> >
> > Bear in mind that most of these functions read single bytes,
> > presently. So it can't get slower than it already is.
>
> Well, it *does* get slower due to extra branches if the caller makes frequent calls to mem* functions with small data size (e.g. 1-15).
>

True. I think I mentioned earlier that it's usually faster for small
copies to go byte by byte instead of minmaxing the data size. This is
the reason why.