[PATCH] msvcrt: Faster memcmp().

Tue Apr 19 05:02:48 CDT 2022

On 4/19/22 10:39, Rémi Bernon wrote:
> On 4/13/22 15:50, Piotr Caban wrote:
>> Hi Jan,
>>
>> On 4/6/22 19:14, Jan Sikorski wrote:
>>> -/*********************************************************************
>>> - *                  memcmp (MSVCRT.@)
>>> - */
>>> -int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
>>> +static inline int memcmp_unaligned(const void *ptr1, const void 
>>> *ptr2, size_t n)
>>>   {
>>>       const unsigned char *p1, *p2;
>>> @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const 
>>> void *ptr2, size_t n)
>>>       return 0;
>>>   }
>> I think it would be good to optimize memcmp_unaligned a little. I'm 
>> thinking about something along these lines (untested):
>> static int memcmp_size_t(size_t s1, size_t s2)
>> {
>>      const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2;
>>      while (*p1 == *p2)
>>      {
>>          p1++;
>>          p2++;
>>      }
>>      return *p1 > *p2 ? 1 : -1;
>> }
>>
> 
> 
> You can make the small case branchless (and slightly faster) by 
> converting to big endian then comparing the value at once.
> 
> 
>> static int memcmp_unaligned(const char *c1, const char *c2, size_t len)
>> {
>>      int sh1 = 8 * ((size_t)c2 % sizeof(size_t));
>>      int sh2 = 8 * sizeof(size_t) - sh1;
>>      const size_t *s1 = (const size_t*)c1;
>>      const size_t *s2 = (const size_t*)(c2 - sh1 / 8);
>>      size_t x, y, m;
>>
>>      x = s2[0];
>>      do
>>      {
>>          y = s2[1];
>>          m = MERGE(x, sh1, y, sh2);
>>          if (*s1 != m)
>>              return memcmp_size_t(*s1, m);
>>          s1++;
>>          s2++;
>>          len--;
>>          x = y;
>>      } while (len);
>>      return 0;
>> }
>>
>> Where MERGE is already defined in string.c file, len is length in 
>> sizeof(size_t) blocks instead of bytes. It may be even better to 
>> switch to uint64_t instead of using size_t like in memset. You can 
>> also take a look on glibc platform independent implementation (it uses 
>> the MERGE trick + loop unrolling, according to some random benchmark 
>> it's on par with your implementation performance wise on i386/x86_64 
>> and is much faster on arm).
>>
>> Thanks,
>> Piotr
>>
> 
> 
> FWIW for me and with a basic benchmark this version is 3x slower than 
> Jan's version on 32bit x86, 2x slower on 64bit.
> 
To make sure we're talking about the same thing - I've meant that 
glibc's (platform independent) version has similar performance. The code 
I have attached here was only added to show general idea while keeping 
the code as simple as possible.