[PATCH v4 1/3] msvcrt: Improve memset performance using overlapping stores.

Rémi Bernon rbernon at codeweavers.com
Wed Sep 15 16:51:23 CDT 2021


On 9/15/21 9:49 PM, Martin Storsjö wrote:
> On Tue, 14 Sep 2021, Rémi Bernon wrote:
> 
>> For n larger than 16 we store 16 bytes on each end of the buffer,
>> eventually overlapping, and then 16 additional bytes for n > 32.
>>
>> Then we can find a 32-byte aligned range overlapping the remaining part
>> of the destination buffer, which is filled 32 bytes at a time in a loop.
>>
>> Signed-off-by: Rémi Bernon <rbernon at codeweavers.com>
>> ---
>> dlls/msvcrt/string.c | 60 +++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 57 insertions(+), 3 deletions(-)
> 
> 
>> -    volatile unsigned char *d = dst;  /* avoid gcc optimizations */
>> -    while (n--) *d++ = c;
>> +    uint64_t v = 0x101010101010101ull * (unsigned char)c;
>> +    unsigned char *d = (unsigned char *)dst;
>> +    size_t a = 0x20 - ((uintptr_t)d & 0x1f);
>> +
>> +    if (n >= 16)
>> +    {
>> +        *(uint64_t *)(d + 0) = v;
>> +        *(uint64_t *)(d + 8) = v;
>> +        *(uint64_t *)(d + n - 16) = v;
>> +        *(uint64_t *)(d + n - 8) = v;
> 
> FYI this broke memset on ARM (32 bit) due to misalignment. ARM used to 
> be quite alignment-picky in older versions, but since ARMv7, 32 bit 
> register loads/stores can be unaligned. For 64 bit writes, there's an 
> instruction STRD, which can't be used unaligned though, but in these 
> cases, the compiler is free to use it.
> 
> The surprising thing about STRD is that it only requires 32 bit 
> alignment, even if it writes 64 bit. First I tried to replace
> 
>          *(uint64_t *)(d + 0) = v;
> 
> with
> 
>          *(uint32_t *)(d + 0) = v;
>          *(uint32_t *)(d + 4) = v;
> 
> hoping to use 32 bit stores (which work unaligned). However, after 
> casting to uint32_t*, the compiler is free to assume that the resulting 
> pointer is 32 bit aligned, and STRD only requires 32 bit alignment, so 
> the compiler can still fuse these two stores into one single STRD.
> 
> By using
> 
>          *(volatile uint32_t *)(d + 0) = v;
>          *(volatile uint32_t *)(d + 4) = v;
> 
> the compiler emits them as two separate 32 bit stores though (which work 
> fine with any alignment).
> 
> I'll send a PoC patch that fixes things for me.
> 
> // Martin

Ouch, okay, sorry about that and thanks for the information.
-- 
Rémi Bernon <rbernon at codeweavers.com>



More information about the wine-devel mailing list