[PATCH] msvcrt: Import memmove from musl

Wed Aug 26 10:19:40 CDT 2020

On 26/08/2020 17:01, Gabriel Ivăncescu wrote:
> On 25/08/2020 20:15, Piotr Caban wrote:
>> On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
>>> I understand `rep movsl` is faster even in the first test than `rep 
>>> movsb`?
>> No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned 
>> overlap" tests. In the "Aligned" case the performance was identical no 
>> matter if movsb or movsl was used.
>>
>> I'm also attaching simple sse2 implementation for comparison. It's 
>> faster than the previous one on my machine. I'm also attaching results 
>> from running the test on Windows (in VM).
>>
>> Thanks,
>> Piotr
> 
> In most cases, the SSE version performs very well, in fact slightly 
> better than the Windows implementation, and does very well for small moves.
> 
> Unfortunately, for some reason, it seems it's quite significantly slower 
> (20% or more) only on the "non-overlapped" case. Attached results.
> 
> Thanks,
> Gabriel

Also, sorry I forgot to mention a small thing, is there a reason you're 
using movdq(a|u) instead of movaps/movups (which are also SSE1 not 
SSE2)? They have smaller encoding and should very slightly help with the 
instruction cache, and no CPU cares about floating vs int states when 
doing only moves. (even if it did, most operations on SSE tend to be for 
floats anyway, assuming some broken CPU has some false dependency on 
them, but I doubt it)