[PATCH v4 1/3] msvcrt: Improve memset performance using overlapping stores.
Rémi Bernon
rbernon at codeweavers.com
Wed Sep 15 16:51:23 CDT 2021
On 9/15/21 9:49 PM, Martin Storsjö wrote:
> On Tue, 14 Sep 2021, Rémi Bernon wrote:
>
>> For n larger than 16 we store 16 bytes on each end of the buffer,
>> eventually overlapping, and then 16 additional bytes for n > 32.
>>
>> Then we can find a 32-byte aligned range overlapping the remaining part
>> of the destination buffer, which is filled 32 bytes at a time in a loop.
>>
>> Signed-off-by: Rémi Bernon <rbernon at codeweavers.com>
>> ---
>> dlls/msvcrt/string.c | 60 +++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 57 insertions(+), 3 deletions(-)
>
>
>> - volatile unsigned char *d = dst; /* avoid gcc optimizations */
>> - while (n--) *d++ = c;
>> + uint64_t v = 0x101010101010101ull * (unsigned char)c;
>> + unsigned char *d = (unsigned char *)dst;
>> + size_t a = 0x20 - ((uintptr_t)d & 0x1f);
>> +
>> + if (n >= 16)
>> + {
>> + *(uint64_t *)(d + 0) = v;
>> + *(uint64_t *)(d + 8) = v;
>> + *(uint64_t *)(d + n - 16) = v;
>> + *(uint64_t *)(d + n - 8) = v;
>
> FYI this broke memset on ARM (32 bit) due to misalignment. ARM used to
> be quite alignment-picky in older versions, but since ARMv7, 32 bit
> register loads/stores can be unaligned. For 64 bit writes, there's an
> instruction STRD, which can't be used unaligned though, but in these
> cases, the compiler is free to use it.
>
> The surprising thing about STRD is that it only requires 32 bit
> alignment, even if it writes 64 bit. First I tried to replace
>
> *(uint64_t *)(d + 0) = v;
>
> with
>
> *(uint32_t *)(d + 0) = v;
> *(uint32_t *)(d + 4) = v;
>
> hoping to use 32 bit stores (which work unaligned). However, after
> casting to uint32_t*, the compiler is free to assume that the resulting
> pointer is 32 bit aligned, and STRD only requires 32 bit alignment, so
> the compiler can still fuse these two stores into one single STRD.
>
> By using
>
> *(volatile uint32_t *)(d + 0) = v;
> *(volatile uint32_t *)(d + 4) = v;
>
> the compiler emits them as two separate 32 bit stores though (which work
> fine with any alignment).
>
> I'll send a PoC patch that fixes things for me.
>
> // Martin
Ouch, okay, sorry about that and thanks for the information.
--
Rémi Bernon <rbernon at codeweavers.com>
More information about the wine-devel
mailing list