wineport: Add support for ctz().

Thu Mar 17 12:20:53 CDT 2011

On 03/17/2011 03:23 AM, David Laight wrote:
> On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
>> __builtin_ctz() compiles to:
>> mov    0x8(%ebp),%eax
>> bsf    %eax,%eax
>>
>> (ffs()-1) compiles to:
>> mov    $0xffffffff,%edx
>> bsf    0x8(%ebp),%eax
>> cmove  %edx,%eax
> ...
>> So yes, there is a reason, ctz() is at least 50% faster.
> I'm not where you get 50% from!
>
> I've read both the intel and amd x86 instruction performance manuals
> (but can't clain to remember all of it!).
> The 'bsf' will be a slow instruction (with constraints on where it
> exectutes, and what can execute in parallel).
> The 'cmove' has even worse constraints since it can't execute until
> the 'flags' from the previous instruction are known.
> cmove is only slightly better than a mis-predicted branch!
> In this case there will be complete pipeline stall between the 'bsf'
> and the 'cmove'.
>
> ffs would probably execute faster with a forwards conditional branch
> (predicted not taken) in the 'return -1' path.
>
> 	David
>
For my purposes the 'return -1' path will never be taken, so it's not 
needed.  I want to be able to do:

while (bitmap)
{
     i = ctz(bitmap); /* get the LSB index */
     bitmap ^= 1 << i; /* zero LSB */

     /* for each set bit i... */
}

If there's a faster way to do this I'm all ears.