[PATCH] msvcrt: Faster memcmp().
Rémi Bernon
rbernon at codeweavers.com
Tue Apr 19 03:39:25 CDT 2022
On 4/13/22 15:50, Piotr Caban wrote:
> Hi Jan,
>
> On 4/6/22 19:14, Jan Sikorski wrote:
>> -/*********************************************************************
>> - * memcmp (MSVCRT.@)
>> - */
>> -int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
>> +static inline int memcmp_unaligned(const void *ptr1, const void
>> *ptr2, size_t n)
>> {
>> const unsigned char *p1, *p2;
>> @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const void
>> *ptr2, size_t n)
>> return 0;
>> }
> I think it would be good to optimize memcmp_unaligned a little. I'm
> thinking about something along these lines (untested):
> static int memcmp_size_t(size_t s1, size_t s2)
> {
> const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2;
> while (*p1 == *p2)
> {
> p1++;
> p2++;
> }
> return *p1 > *p2 ? 1 : -1;
> }
>
You can make the small case branchless (and slightly faster) by
converting to big endian then comparing the value at once.
> static int memcmp_unaligned(const char *c1, const char *c2, size_t len)
> {
> int sh1 = 8 * ((size_t)c2 % sizeof(size_t));
> int sh2 = 8 * sizeof(size_t) - sh1;
> const size_t *s1 = (const size_t*)c1;
> const size_t *s2 = (const size_t*)(c2 - sh1 / 8);
> size_t x, y, m;
>
> x = s2[0];
> do
> {
> y = s2[1];
> m = MERGE(x, sh1, y, sh2);
> if (*s1 != m)
> return memcmp_size_t(*s1, m);
> s1++;
> s2++;
> len--;
> x = y;
> } while (len);
> return 0;
> }
>
> Where MERGE is already defined in string.c file, len is length in
> sizeof(size_t) blocks instead of bytes. It may be even better to switch
> to uint64_t instead of using size_t like in memset. You can also take a
> look on glibc platform independent implementation (it uses the MERGE
> trick + loop unrolling, according to some random benchmark it's on par
> with your implementation performance wise on i386/x86_64 and is much
> faster on arm).
>
> Thanks,
> Piotr
>
FWIW for me and with a basic benchmark this version is 3x slower than
Jan's version on 32bit x86, 2x slower on 64bit.
--
Rémi Bernon <rbernon at codeweavers.com>
More information about the wine-devel
mailing list