[PATCH] msvcrt: Improve memset performance on i386 and x86_64 architectures.

Sun Sep 12 03:35:57 CDT 2021

On 9/11/21 8:30 PM, Rémi Bernon wrote:
> On 9/11/21 7:38 PM, Piotr Caban wrote:
>> On 9/11/21 4:41 PM, Rémi Bernon wrote:
>>> On 9/11/21 8:51 AM, Piotr Caban wrote:
>>>> Signed-off-by: Piotr Caban <piotr at codeweavers.com>
>>>> ---
>>>>   dlls/msvcrt/string.c | 126 
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 126 insertions(+)
>>>>
>>>>
>>>
>>> FWIW as far as I can see on my simple throughput benchmarks, and with 
>>> the default optimization flags (-O2), the unrolled C version:
>>>
>>> * Outperforms the SSE2 assembly on x86_64 for n <= 32 (20GB/s vs 
>>> 12GB/s for n = 32), and performs equally as good for "aligned" 
>>> operations on larger sizes.
>>>
>>> * It performs roughly at a third (25GB/s vs 70GB/s on my computer) on 
>>> unaligned operations like memset(dst + 1, src, n) and n >= 256.
>>>
>>> * On i686 it performs equally for small sizes (n <= 128) and then 
>>> performs at half the throughput (35GB/s vs 70GB/s) for aligned 
>>> operations and a third for unaligned ones.
>>>
>>> It still has the advantage of being C code, benefiting all 
>>> architectures.
>> I think we should also improve the C implementation (I was planning to 
>> encourage you to upstream it).
>>
> 
> Sure, I will then.
> 
>> I don't have your full benchmark results but I think that the general 
>> conclusion is that SSE implementation is equally good or much faster 
>> for n>=64. I will need to improve the n<64 case.
>>
>> Here are some results from my machine (x86_64, it shows how SSE 
>> implementation compares to yours):
>>   - 64MB aligned block - 1.2 * faster
>>   - 64MB unaligned - 1.3 * faster
>>   - 1MB aligned - 2 * faster
>>   - 1MB unaligned - 5 * faster
>>   - 32 bytes aligned - 2 * slower
>>   - 32 bytes unaligned - 2.3 * slower
>>   - 9 bytes - 1.3 * slower
>>
>> Thanks,
>> Piotr
> 
> The SSE2 version is definitely still better in a lot of cases, and 
> especially for large sizes.
> 
> Then, for these cases I'm thinking that the erms approach is probably 
> the most future-proof.
> 
> Its implementation is very simple and it provides the best performance 
> possible for large sizes, at least on recent CPUs, usually faster than 
> SSE2, and possibly better for the CPU cache (I believe).
> 
> It also looks like that Intel and AMD intend to keep on improving rep 
> movsb/stosb performance, and make it the preferred way to copy or clear 
> memory, over any vectorized instruction implementation, so it could even 
> be the best for small sizes.
> 
> I'm attaching the results I have for all the versions, including an AVX 
> implementation that I had (although I've done it with intel intrinsics 
> instead of assembly).
> 
> (I actually modified a bit the unrolled C version for these results, as 
> reversing the order of the assignments in the loops did seem to improve 
> performance somehow for the unaligned cases.)
> 
> Cheers,

Forgot to mention it, but I think this is an interesting read on that topic:

https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/

-- 
Rémi Bernon <rbernon at codeweavers.com>