[PATCH 4/4] msvcrt: Add an SSE2 memset_aligned_32 implementation.

Rémi Bernon rbernon at codeweavers.com
Mon Sep 13 10:16:13 CDT 2021


On 9/13/21 4:50 PM, Piotr Caban wrote:
> Hi Rémi,
> 
> I think you're undervaluing the SSE2 codepath. While erms was introduced 
> on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as 
> far as I understand the first AMD CPU to set the cpuid flag was released 
> in mid 2019).
> 

Okay, I admit that I don't know precisely which CPUs era are covered. 
But even in that case, I'm not sure that it's worth introducing an SSE2 
code path.

Although the non-vectorized code is twice or three times slower than 
what SSE2 could do, it's still 25 times faster than the current code, 
which IMHO is good enough for most CPUs, and doesn't need specific 
instructions.

I'm also sure SSE2 (and ERMS as well) have a lot of quirks and 
performance profile variations across CPU models and I feel like it can 
be very tricky and a bit worthless to try to finely optimize them.

Yet, I'm only arguing because I felt it was possible to write a good 
enough implementation in C. I don't mind very much in the end.

> On 9/13/21 2:23 PM, Rémi Bernon wrote:
> 
>> +#ifdef __i386__
>> +    if (n < 2048 && sse2_supported)
> if ((n < 2048 && sse2_supported) || !erms_supported)
>> +#else
>> +    if (n < 2048)
> if (n < 2048 || !erms_supported)
>> +#endif
>> +    {
>> +        __asm__ __volatile__ (
>> +            "movd %1, %%xmm0\n\t"
>> +            "pshufd $0, %%xmm0, %%xmm0\n\t"
>> +            "test $0x20, %2\n\t"
>> +            "je 1f\n\t"
>> +            "sub $0x20, %2\n\t"
>> +            "movdqa %%xmm0, 0x00(%0,%2)\n\t"
>> +            "movdqa %%xmm0, 0x10(%0,%2)\n\t"
>> +            "je 2f\n\t"
>> +            "1:\n\t"
>> +            "sub $0x40, %2\n\t"
>> +            "movdqa %%xmm0, 0x00(%0,%2)\n\t"
>> +            "movdqa %%xmm0, 0x10(%0,%2)\n\t"
>> +            "movdqa %%xmm0, 0x20(%0,%2)\n\t"
>> +            "movdqa %%xmm0, 0x30(%0,%2)\n\t"
>> +            "ja 1b\n\t"
>> +            "2:\n\t"
>> +            :
>> +            : "r"(d), "r"((uint32_t)v), "c"(n)
>> +            : "memory"
>> +        );
> Shouldn't xmm0 be added to clobbered registers list?
> 

I guess, yes and "cc" for the flags too I suppose.
-- 
Rémi Bernon <rbernon at codeweavers.com>



More information about the wine-devel mailing list