<div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 7, 2022, 10:56 AM Elaine Lefler <<a href="mailto:elaineclefler@gmail.com">elaineclefler@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Wed, Apr 6, 2022 at 6:02 AM Jinoh Kang <<a href="mailto:jinoh.kang.kr@gmail.com" target="_blank" rel="noreferrer">jinoh.kang.kr@gmail.com</a>> wrote:<br><br>

> > So that's some complicated code which isn't actually better than a<br>

> > straightforward uint64_t loop. I think that's the reason I prefer<br>

> > seeing intrinsics - granted, I have a lot of experience reading them,<br>

> > and I understand they're unfriendly to people who aren't familiar -<br>

> > but they give you assurance that the compiler actually works as<br>

> > expected.<br>

><br>

> I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.<br>

><br>

<br>

IME, it's almost impossible to hand-write ASM that outperforms a<br>

compiler. You might have to rearrange the C code a bit, but<br>

well-written intrinsics are usually just as good (within margin of<br>

error) as ASM, and much more flexible.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Perhaps I've misphrased myself here. Note that "direct assembly != completely hand-written assembly." It's a bold claim that a *human* could outperform a compiler in machine code optimization in the first place. I said we should stick to assembler because instruction scheduling is more predictable across compilers that way, *not* because a human could do better at scheduling. We can take the assembly output from the best of the compilers and do whatever we please on it. (That's even how it's usually done!) This will bring the optimization work to much older and/or less capable compilers, since we're not relying on the user's compiler's performance. Note that Wine still supports GCC 4.x. Also, future compiler regressions may affect the performance of the optimized code (as Jan puts it).</div><div dir="auto"><br></div><div dir="auto">llvm-mca simulates CPU pipeline and shows how well your code would perform on a superscalar architecture. Perhaps we can use that as well.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

When writing high-performance code it's usually necessary to try<br>

multiple variations in order to find the fastest path. You can't<br>

easily do that with ASM, and it leads to incomplete optimization.</blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yeah, we can first write the first version in C with intrinsics, look at differences between outputs of serveral compilers, and choose the best one.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> For<br>

instance, I was able to write a C memcpy function that outperforms<br>

msvcrt's hand-written assembly, for the following reasons:<br>

- The ASM version has too many branches for small copies, branch<br>

misprediction is a huge source of latency.<br>

- The ASM version puts the main loop in the middle of the function,<br>

leading to a very large jump when writing small data. GCC puts it at<br>

the end, which is more optimal (basically, a large copy can afford to<br>

eat the lag from the jump, but a small copy can't).<br>

- When copying large data it's better to force alignment on both src<br>

and dst, even though you have to do some math.<br>

- Stores should be done with MOVNTDQ rather than MOVDQA. MOVNTDQ<br>

avoids cache evictions so you don't have to refill the entire cache<br>

pool after memcpy returns. Note that this would have been an easy<br>

one-line change even for ASM, but I suspect the developer was too<br>

exhausted to experiment.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">These improvements (except code placement for I-cache utilization) have nothing to do with compiler optimization. The programmer can make these mistakes either way (ASM or C).</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

So I'm strongly opposed to ASM unless a C equivalent is completely<br>

impossible. The code that a developer _thinks_ will be fast and the<br>

code that is _actually_ fast are often not the same thing. C makes it<br>

much easier to tweak.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Or rather harder to tweak, since code arrangement is not something programmer has control over.</div></div>