[PATCH] msvcrt: Faster memcmp().

Wed Apr 13 08:50:20 CDT 2022

Hi Jan,

On 4/6/22 19:14, Jan Sikorski wrote:
> -/*********************************************************************
> - *                  memcmp (MSVCRT.@)
> - */
> -int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
> +static inline int memcmp_unaligned(const void *ptr1, const void *ptr2, size_t n)
>   {
>       const unsigned char *p1, *p2;
>   
> @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
>       return 0;
>   }
I think it would be good to optimize memcmp_unaligned a little. I'm 
thinking about something along these lines (untested):
static int memcmp_size_t(size_t s1, size_t s2)
{
     const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2;
     while (*p1 == *p2)
     {
         p1++;
         p2++;
     }
     return *p1 > *p2 ? 1 : -1;
}

static int memcmp_unaligned(const char *c1, const char *c2, size_t len)
{
     int sh1 = 8 * ((size_t)c2 % sizeof(size_t));
     int sh2 = 8 * sizeof(size_t) - sh1;
     const size_t *s1 = (const size_t*)c1;
     const size_t *s2 = (const size_t*)(c2 - sh1 / 8);
     size_t x, y, m;

     x = s2[0];
     do
     {
         y = s2[1];
         m = MERGE(x, sh1, y, sh2);
         if (*s1 != m)
             return memcmp_size_t(*s1, m);
         s1++;
         s2++;
         len--;
         x = y;
     } while (len);
     return 0;
}

Where MERGE is already defined in string.c file, len is length in 
sizeof(size_t) blocks instead of bytes. It may be even better to switch 
to uint64_t instead of using size_t like in memset. You can also take a 
look on glibc platform independent implementation (it uses the MERGE 
trick + loop unrolling, according to some random benchmark it's on par 
with your implementation performance wise on i386/x86_64 and is much 
faster on arm).

Thanks,
Piotr