[PATCH] msvcrt: Improve memset performance on i386 and x86_64 architectures.
Piotr Caban
piotr.caban at gmail.com
Sat Sep 11 12:38:56 CDT 2021
On 9/11/21 4:41 PM, Rémi Bernon wrote:
> On 9/11/21 8:51 AM, Piotr Caban wrote:
>> Signed-off-by: Piotr Caban <piotr at codeweavers.com>
>> ---
>> dlls/msvcrt/string.c | 126 +++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 126 insertions(+)
>>
>>
>
> FWIW as far as I can see on my simple throughput benchmarks, and with
> the default optimization flags (-O2), the unrolled C version:
>
> * Outperforms the SSE2 assembly on x86_64 for n <= 32 (20GB/s vs 12GB/s
> for n = 32), and performs equally as good for "aligned" operations on
> larger sizes.
>
> * It performs roughly at a third (25GB/s vs 70GB/s on my computer) on
> unaligned operations like memset(dst + 1, src, n) and n >= 256.
>
> * On i686 it performs equally for small sizes (n <= 128) and then
> performs at half the throughput (35GB/s vs 70GB/s) for aligned
> operations and a third for unaligned ones.
>
> It still has the advantage of being C code, benefiting all architectures.
I think we should also improve the C implementation (I was planning to
encourage you to upstream it).
I don't have your full benchmark results but I think that the general
conclusion is that SSE implementation is equally good or much faster for
n>=64. I will need to improve the n<64 case.
Here are some results from my machine (x86_64, it shows how SSE
implementation compares to yours):
- 64MB aligned block - 1.2 * faster
- 64MB unaligned - 1.3 * faster
- 1MB aligned - 2 * faster
- 1MB unaligned - 5 * faster
- 32 bytes aligned - 2 * slower
- 32 bytes unaligned - 2.3 * slower
- 9 bytes - 1.3 * slower
Thanks,
Piotr
More information about the wine-devel
mailing list