wineport: Add support for ctz().

Thu Mar 17 03:23:17 CDT 2011

On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
> 
> __builtin_ctz() compiles to:
> mov    0x8(%ebp),%eax
> bsf    %eax,%eax
> 
> (ffs()-1) compiles to:
> mov    $0xffffffff,%edx
> bsf    0x8(%ebp),%eax
> cmove  %edx,%eax
...
> 
> So yes, there is a reason, ctz() is at least 50% faster.

I'm not where you get 50% from!

I've read both the intel and amd x86 instruction performance manuals
(but can't clain to remember all of it!).
The 'bsf' will be a slow instruction (with constraints on where it
exectutes, and what can execute in parallel).
The 'cmove' has even worse constraints since it can't execute until
the 'flags' from the previous instruction are known.
cmove is only slightly better than a mis-predicted branch!
In this case there will be complete pipeline stall between the 'bsf'
and the 'cmove'.

ffs would probably execute faster with a forwards conditional branch 
(predicted not taken) in the 'return -1' path.

	David

-- 
David Laight: david at l8s.co.uk