[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Tue Apr 5 03:14:02 CDT 2022

Hello everyone,

> On 2 Apr 2022, at 06:44, Elaine Lefler <elaineclefler at gmail.com> wrote:
> 
> Should be noted that SSE2 also exists on 32-bit processors, and in
> this same file you can find usage of "sse2_supported", which would
> enable you to use this code path on i386. You can put
> __attribute__((target("sse2"))) on the declaration of sse2_memcmp to
> allow GCC to emit SSE2 instructions even when the file's architecture
> forbids it.

True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.

> I think this could be even faster if you forced ptr1 to be aligned by
> byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't
> reasonably force-align both pointers, but aligning at least one should
> give measurably better performance.

Right, this memcmp isn’t really an optimized routine, it was not supposed to be. It’s just to get baseline reasonable performance with the simplest possible code. More careful optimizations can follow.

> memcmp is also a function that appears in
> both dlls. Do you have any input on that?

Not really, I don’t know who uses the one in ntdll and how much they care about speed. Just copy it if needed?

> On 2 Apr 2022, at 13:19, Rémi Bernon <rbernon at codeweavers.com> wrote:
> 
> The question is, do we really need and want the complexity induced by hand-crafted assembly (or intrinsics) routines?

I’d argue there’s not much complexity here.

> * at build time but also runtime, we'll need to carefully check hardware capability,

AMD64 has SSE2. There’s nothing to carefully check?

> * it increases maintenance burden as they may need to be updated when hardware performance profile changes, or when new features are added,

If we want to be on top in terms of speed, yes, but that’s the work—what else to do?

> Or do we want to rely as much as possible on the compiler to do it for us?

I hope not. I mean, if the compiler included its own, known good, mem* intrinsics, sure, but I wouldn’t count on it recognizing patterns in C code, unless we want to deal with compiler regressions as well.

> On 3 Apr 2022, at 14:59, Rémi Bernon <rbernon at codeweavers.com> wrote:
> 
> Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.

Same as above, no thanks to relying on compiler smartness.

> Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.

I’d say if a program is bottlenecked by tiny memcmp’s it’s the program that’s slow, not the memcmp. That’s what you generally get for dealing with many little things one at a time.

> On 2 Apr 2022, at 17:00, Piotr Caban <piotr.caban at gmail.com> wrote:
> 
> On 4/2/22 13:19, Rémi Bernon wrote:
>> (I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
> I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).

It wouldn’t hurt, sure. For speed, we probably can't expect more than 8x by comparing 8 bytes at a time, which is less than the 13x I measured. And I want it to be reasonably fast for wined3d. Besides, a platform independent C version wouldn’t be any simpler--barring Intel’s wonderful naming, I’d say the code is about as trivial as it can get. I don’t think a plain C version would be useful, but if that’s the bar for getting it done, so be it..

- Jan