[PATCH] msvcrt: Improve memset performance on i386 and x86_64 architectures.

Sat Sep 11 13:30:20 CDT 2021

On 9/11/21 7:38 PM, Piotr Caban wrote:
> On 9/11/21 4:41 PM, Rémi Bernon wrote:
>> On 9/11/21 8:51 AM, Piotr Caban wrote:
>>> Signed-off-by: Piotr Caban <piotr at codeweavers.com>
>>> ---
>>>   dlls/msvcrt/string.c | 126 +++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 126 insertions(+)
>>>
>>>
>>
>> FWIW as far as I can see on my simple throughput benchmarks, and with 
>> the default optimization flags (-O2), the unrolled C version:
>>
>> * Outperforms the SSE2 assembly on x86_64 for n <= 32 (20GB/s vs 
>> 12GB/s for n = 32), and performs equally as good for "aligned" 
>> operations on larger sizes.
>>
>> * It performs roughly at a third (25GB/s vs 70GB/s on my computer) on 
>> unaligned operations like memset(dst + 1, src, n) and n >= 256.
>>
>> * On i686 it performs equally for small sizes (n <= 128) and then 
>> performs at half the throughput (35GB/s vs 70GB/s) for aligned 
>> operations and a third for unaligned ones.
>>
>> It still has the advantage of being C code, benefiting all architectures.
> I think we should also improve the C implementation (I was planning to 
> encourage you to upstream it).
> 

Sure, I will then.

> I don't have your full benchmark results but I think that the general 
> conclusion is that SSE implementation is equally good or much faster for 
> n>=64. I will need to improve the n<64 case.
> 
> Here are some results from my machine (x86_64, it shows how SSE 
> implementation compares to yours):
>   - 64MB aligned block - 1.2 * faster
>   - 64MB unaligned - 1.3 * faster
>   - 1MB aligned - 2 * faster
>   - 1MB unaligned - 5 * faster
>   - 32 bytes aligned - 2 * slower
>   - 32 bytes unaligned - 2.3 * slower
>   - 9 bytes - 1.3 * slower
> 
> Thanks,
> Piotr

The SSE2 version is definitely still better in a lot of cases, and 
especially for large sizes.

Then, for these cases I'm thinking that the erms approach is probably 
the most future-proof.

Its implementation is very simple and it provides the best performance 
possible for large sizes, at least on recent CPUs, usually faster than 
SSE2, and possibly better for the CPU cache (I believe).

It also looks like that Intel and AMD intend to keep on improving rep 
movsb/stosb performance, and make it the preferred way to copy or clear 
memory, over any vectorized instruction implementation, so it could even 
be the best for small sizes.

I'm attaching the results I have for all the versions, including an AVX 
implementation that I had (although I've done it with intel intrinsics 
instead of assembly).

(I actually modified a bit the unrolled C version for these results, as 
reversing the order of the assignments in the loops did seem to improve 
performance somehow for the unaligned cases.)

Cheers,
-- 
Rémi Bernon <rbernon at codeweavers.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memset-avx.log
Type: text/x-log
Size: 6523 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20210911/73453a01/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memset-erms.log
Type: text/x-log
Size: 6523 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20210911/73453a01/attachment-0005.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memset-sse2.log
Type: text/x-log
Size: 6520 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20210911/73453a01/attachment-0006.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memset-unroll.log
Type: text/x-log
Size: 6526 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20210911/73453a01/attachment-0007.bin>