[PATCH] msvcrt: Faster memcmp().
Piotr Caban
piotr.caban at gmail.com
Tue Apr 19 05:02:48 CDT 2022
On 4/19/22 10:39, Rémi Bernon wrote:
> On 4/13/22 15:50, Piotr Caban wrote:
>> Hi Jan,
>>
>> On 4/6/22 19:14, Jan Sikorski wrote:
>>> -/*********************************************************************
>>> - * memcmp (MSVCRT.@)
>>> - */
>>> -int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n)
>>> +static inline int memcmp_unaligned(const void *ptr1, const void
>>> *ptr2, size_t n)
>>> {
>>> const unsigned char *p1, *p2;
>>> @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const
>>> void *ptr2, size_t n)
>>> return 0;
>>> }
>> I think it would be good to optimize memcmp_unaligned a little. I'm
>> thinking about something along these lines (untested):
>> static int memcmp_size_t(size_t s1, size_t s2)
>> {
>> const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2;
>> while (*p1 == *p2)
>> {
>> p1++;
>> p2++;
>> }
>> return *p1 > *p2 ? 1 : -1;
>> }
>>
>
>
> You can make the small case branchless (and slightly faster) by
> converting to big endian then comparing the value at once.
>
>
>> static int memcmp_unaligned(const char *c1, const char *c2, size_t len)
>> {
>> int sh1 = 8 * ((size_t)c2 % sizeof(size_t));
>> int sh2 = 8 * sizeof(size_t) - sh1;
>> const size_t *s1 = (const size_t*)c1;
>> const size_t *s2 = (const size_t*)(c2 - sh1 / 8);
>> size_t x, y, m;
>>
>> x = s2[0];
>> do
>> {
>> y = s2[1];
>> m = MERGE(x, sh1, y, sh2);
>> if (*s1 != m)
>> return memcmp_size_t(*s1, m);
>> s1++;
>> s2++;
>> len--;
>> x = y;
>> } while (len);
>> return 0;
>> }
>>
>> Where MERGE is already defined in string.c file, len is length in
>> sizeof(size_t) blocks instead of bytes. It may be even better to
>> switch to uint64_t instead of using size_t like in memset. You can
>> also take a look on glibc platform independent implementation (it uses
>> the MERGE trick + loop unrolling, according to some random benchmark
>> it's on par with your implementation performance wise on i386/x86_64
>> and is much faster on arm).
>>
>> Thanks,
>> Piotr
>>
>
>
> FWIW for me and with a basic benchmark this version is 3x slower than
> Jan's version on 32bit x86, 2x slower on 64bit.
>
To make sure we're talking about the same thing - I've meant that
glibc's (platform independent) version has similar performance. The code
I have attached here was only added to show general idea while keeping
the code as simple as possible.
More information about the wine-devel
mailing list