[PATCH] msvcrt: Faster memcmp().

Tue Apr 19 03:39:25 CDT 2022

On 4/13/22 15:50, Piotr Caban wrote:
> Hi Jan,
> 
> On 4/6/22 19:14, Jan Sikorski wrote:
>> -/*********************************************************************
>> - *                  memcmp (MSVCRT.@)
>> - */
>> -int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
>> +static inline int memcmp_unaligned(const void *ptr1, const void 
>> *ptr2, size_t n)
>>   {
>>       const unsigned char *p1, *p2;
>> @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const void 
>> *ptr2, size_t n)
>>       return 0;
>>   }
> I think it would be good to optimize memcmp_unaligned a little. I'm 
> thinking about something along these lines (untested):
> static int memcmp_size_t(size_t s1, size_t s2)
> {
>      const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2;
>      while (*p1 == *p2)
>      {
>          p1++;
>          p2++;
>      }
>      return *p1 > *p2 ? 1 : -1;
> }
> 

You can make the small case branchless (and slightly faster) by 
converting to big endian then comparing the value at once.

> static int memcmp_unaligned(const char *c1, const char *c2, size_t len)
> {
>      int sh1 = 8 * ((size_t)c2 % sizeof(size_t));
>      int sh2 = 8 * sizeof(size_t) - sh1;
>      const size_t *s1 = (const size_t*)c1;
>      const size_t *s2 = (const size_t*)(c2 - sh1 / 8);
>      size_t x, y, m;
> 
>      x = s2[0];
>      do
>      {
>          y = s2[1];
>          m = MERGE(x, sh1, y, sh2);
>          if (*s1 != m)
>              return memcmp_size_t(*s1, m);
>          s1++;
>          s2++;
>          len--;
>          x = y;
>      } while (len);
>      return 0;
> }
> 
> Where MERGE is already defined in string.c file, len is length in 
> sizeof(size_t) blocks instead of bytes. It may be even better to switch 
> to uint64_t instead of using size_t like in memset. You can also take a 
> look on glibc platform independent implementation (it uses the MERGE 
> trick + loop unrolling, according to some random benchmark it's on par 
> with your implementation performance wise on i386/x86_64 and is much 
> faster on arm).
> 
> Thanks,
> Piotr
> 

FWIW for me and with a basic benchmark this version is 3x slower than 
Jan's version on 32bit x86, 2x slower on 64bit.

-- 
Rémi Bernon <rbernon at codeweavers.com>