[PATCH] ntdll: Optimize memcpy for x86-64.

Elaine Lefler elaineclefler at gmail.com
Wed Mar 30 21:51:59 CDT 2022


On Wed, Mar 30, 2022 at 11:46 AM Gabriel Ivăncescu
<gabrielopcode at gmail.com> wrote:
>
> Why not just copy pasting it from msvcrt since it's already done?
>

This is interesting. Mine is more efficient than the msvcrt version
(at least for sse2), because the msvcrt version relies on unaligned
moves whereas mine uses bit shifting to allow both src and dst to be
aligned. Also, in my testing I found that it's most efficient to use
plain old bytes copy for the first/last parts of the function,
anything else just causes branch mispredictions.

It would also be optimal to write an AVX2 version of this function for
systems that support it (which is most of them, nowadays). AVX2 has a
32-byte register and can move data a lot faster. It needs a cpuid
check though. And probably not worth it for anything other than
memcpy.

As for memset, while the code looks like it deals in aligned pointers,
examining the output with objdump shows that GCC fails to produce any
MOVDQA/MOVAPS/MOVNTDQ instructions. So it's not really doing what it
claims to do. In order to emit those instructions you need to define
an aligned 16-byte structure and use it in your pointers.

On Wed, Mar 30, 2022 at 10:34 AM Rémi Bernon <rbernon at codeweavers.com> wrote:
> IIUC upstream isn't very interested in assembly optimized routine,
> unless really necessary.

I don't think hand-optimized asm is necessary here. Though the usage
of _mm_ functions is pretty close to it. I think my routine can be
refactored to be more platform-independent. The main annoyance is the
fact that PSLLDQ/PSRLDQ only accept compile-time constants as shift
values, meaning each of the fastcpy functions has slightly different
machine code - hence the jump table. You really don't want conditional
jumps inside that inner loop. For memcpy I think the annoyance is
worth it, but I'd hesitate to use it elsewhere.

To be honest, I had planned to submit patches for several of these
functions, including memset, and strlen. Most of these can be done
without the need for special intrinsics, but the two-argument
functions like strcmp get a lot more ugly. It might be best to create
a separate .c file for platform-specific implementations. IMO,
string.c is widely-used enough to justify extra complexity, but of
course I'm not the project maintainer.

The duplication between msvcrt and ntdll seems strange. From what I
can tell (may be wrong), most apps are probably using the msvcrt
version, but it looks like Wine libraries run the ntdll version
instead. Also, msvcrt seems to have some routines that are more
optimized than ntdll, but memset and only memset has the optimized
version in both. So it looks like I'm not the first person to miss
this detail. Is it possible to remove the code from ntdll and only
call msvcrt (or the other way around)? Or, worst case, have them both
compile the same C file so we don't have the same function in two
places?

- Elaine



More information about the wine-devel mailing list