[PATCH v4 1/3] msvcrt: Improve memset performance using overlapping stores.

Martin Storsjö martin at martin.st
Wed Sep 15 14:49:46 CDT 2021


On Tue, 14 Sep 2021, Rémi Bernon wrote:

> For n larger than 16 we store 16 bytes on each end of the buffer,
> eventually overlapping, and then 16 additional bytes for n > 32.
>
> Then we can find a 32-byte aligned range overlapping the remaining part
> of the destination buffer, which is filled 32 bytes at a time in a loop.
>
> Signed-off-by: Rémi Bernon <rbernon at codeweavers.com>
> ---
> dlls/msvcrt/string.c | 60 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 57 insertions(+), 3 deletions(-)


> -    volatile unsigned char *d = dst;  /* avoid gcc optimizations */
> -    while (n--) *d++ = c;
> +    uint64_t v = 0x101010101010101ull * (unsigned char)c;
> +    unsigned char *d = (unsigned char *)dst;
> +    size_t a = 0x20 - ((uintptr_t)d & 0x1f);
> +
> +    if (n >= 16)
> +    {
> +        *(uint64_t *)(d + 0) = v;
> +        *(uint64_t *)(d + 8) = v;
> +        *(uint64_t *)(d + n - 16) = v;
> +        *(uint64_t *)(d + n - 8) = v;

FYI this broke memset on ARM (32 bit) due to misalignment. ARM used to be 
quite alignment-picky in older versions, but since ARMv7, 32 bit register 
loads/stores can be unaligned. For 64 bit writes, there's an instruction 
STRD, which can't be used unaligned though, but in these cases, the 
compiler is free to use it.

The surprising thing about STRD is that it only requires 32 bit alignment, 
even if it writes 64 bit. First I tried to replace

         *(uint64_t *)(d + 0) = v;

with

         *(uint32_t *)(d + 0) = v;
         *(uint32_t *)(d + 4) = v;

hoping to use 32 bit stores (which work unaligned). However, after casting 
to uint32_t*, the compiler is free to assume that the resulting pointer is 
32 bit aligned, and STRD only requires 32 bit alignment, so the compiler 
can still fuse these two stores into one single STRD.

By using

         *(volatile uint32_t *)(d + 0) = v;
         *(volatile uint32_t *)(d + 4) = v;

the compiler emits them as two separate 32 bit stores though (which work 
fine with any alignment).

I'll send a PoC patch that fixes things for me.

// Martin


More information about the wine-devel mailing list