[PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Wed Apr 6 22:59:20 CDT 2022

On Wed, Apr 6, 2022 at 8:28 PM Jin-oh Kang <jinoh.kang.kr at gmail.com> wrote:
>
> Perhaps I've misphrased myself here. Note that "direct assembly != completely hand-written assembly." It's a bold claim that a *human* could outperform a compiler in machine code optimization in the first place. I said we should stick to assembler because instruction scheduling is more predictable across compilers that way, *not* because a human could do better at scheduling. We can take the assembly output from the best of the compilers and do whatever we please on it. (That's even how it's usually done!) This will bring the optimization work to much older and/or less capable compilers, since we're not relying on the user's compiler's performance.
>
> llvm-mca simulates CPU pipeline and shows how well your code would perform on a superscalar architecture. Perhaps we can use that as well.
>

>
> Yeah, we can first write the first version in C with intrinsics, look at differences between outputs of serveral compilers, and choose the best one.
>

Fair points. Although inline ASM is still a pain for code reviewers.

I have noticed that Clang produces much better instructions than GCC.
In my case I solved this problem by rewriting the C code with
different intrinsics, so now it's basically the same on both
compilers. It's definitely worth _looking_ at the output from multiple
compilers, but I'm not sure how often (if ever) this can't be solved
by rewriting your C code.

> Note that Wine still supports GCC 4.x. Also, future compiler regressions may affect the performance of the optimized code (as Jan puts it).

True, regressions happen, but that could affect any code, not just the
optimized stuff.

In my mind, if you're compiling Wine with a very old GCC, and it
performs poorly as a result, that's a "you" problem. Binary packagers
should be using more modern compilers. You're not going to have a good
experience running the latest AAA games on a system that's too old for
GCC 11, even if it's theoretically possible.

>
> These improvements (except code placement for I-cache utilization) have nothing to do with compiler optimization. The programmer can make these mistakes either way (ASM or C).
>

You're correct. My point is that it's much harder to _fix_ these
mistakes when writing ASM. And code that is difficult to fix is often
left untouched.

>>
>> So I'm strongly opposed to ASM unless a C equivalent is completely
>> impossible. The code that a developer _thinks_ will be fast and the
>> code that is _actually_ fast are often not the same thing. C makes it
>> much easier to tweak.
>
>
> Or rather harder to tweak, since code arrangement is not something programmer has control over.

In most cases, performance is dictated by memory access patterns
rather than instruction arrangement. Compilers generally won't mess
with that. CPUs don't execute instructions in the order they're
written, but rather in the order that the CPU's microcode believes to
be fastest.

As a developer, you mainly want to avoid dependency chains (i.e. code
that requires a strict "A, then B, then C" order of operations) but
you can do that in either C or in ASM. Branches are also a bottleneck,
but those are hint-able, and even an old compiler likely understands
them better than a human.

Also, by the same token as "compiler regressions could hamper
performance", compiler improvements could produce better code, with no
need for more developer work. Whereas pinning the ASM requires
checking it again in the future to see if code generation has
improved.

To be clear, I don't think it's impossible for inline ASM to be
superior, I just think it's unlikely to be worth the effort.