ntdll: Use lockfree implementation for get_cached_fd.

Sat May 31 14:13:55 CDT 2014

Hi Ken,

thanks for taking a look at it and for the detailed answer.

[...]
> 
> Thanks for working on this.  Interesting stuff.
> 
> I think there are still some issues.
> 
> In order to ensure a consistent view of the ordering of memory operations, memory barriers have to be used in both the writer and the reader.  The writer issues a write barrier after it has written and the reader issues a read barrier before it reads.
> 
> The appropriate barriers are implicit in interlocked_xchg_add() for incrementing fd_cache_epoch, but the first read of it in get_cached_fd() has no read barrier before it.  I think this is benign, though, because I think this can only result in the code thinking the epoch changed when it hasn't.  It may retry unnecessarily.

get_cached_fd() will only have to run multiple iterations if a second
thread updated the cache in the meantime. As it is more likely to have a
cache hit I would assume that its okay, when fd_cache_epoch returns a
cached value - but of course this depends on the exact program you're
running.

> More problematic is the writing of the fd and its attributes.  There are barriers implicit in the interlocked_xchg() which clears the fd, but there's no write barrier after the final write to it.  There's no read barrier before the first read of them.
> 
> You mention the order of writing, attributes first then fd, as being important.  But there's no barrier in between the two operations, so either the compiler or the CPU could reorder either the writes or the reads, thwarting your design.  The compiler can also reorder the reading of the fd_cache_epoch with respect to the reading of the fd or attributes.  I think you can get this as the effective sequence:
> 
> thread1 / get_cached_fd / read fd, get old fd
> thread2 / add_fd_to_cache / clear fd
> thread2 / add_fd_to_cache / increment epoch
> thread2 / add_fd_to_cache / set fd
> thread1 / get_cached_fd / read epoch
> thread1 / get_cached_fd / check if epoch has changed, conclude it hasn't
> thread1 / get_cached_fd / return old fd
> 
> I think it's sufficient to have a write barrier after writing the fd and then a read barrier just before reading it.  I _think_ that this obviates the need for the barrier at the bottom of the do-while loop.  That is, you can just move that barrier up a line.  I'm not 100% confident of my analysis of that.  (Well, frankly, I'm never 100% certain of my analysis of this sort of code.)

I have to agree that this is a bit tricky - but I don't think that
moving the line one up would completely solve it. The compiler could
still reorder (read epoch) and (read attributes), which could result in
using wrong attributes. Moreover it there is no barrier at the original
place it could still reorder (read epoch 2) and (read fd). To be
completely sure it would be necessary to add compiler memory barriers
everywhere inbetween ... which would make the code look very ugly... xD

Nevertheless, the example above couldn't happen that way: As
add_fd_to_cache() is only used to set the file descriptor, not to
replace it, the old fd would be (-1) in this case. ;)

> 
> In the email thread regarding Daniel Horn's patch, I suggested an approach using interlocked_cmpxchg() to check if the fd had changed after reading the attributes.  Was there a reason you rejected that approach?  You mentioned an approach you experimented with using interlocked_cmpxchg64().  What was that about?  Were you treating the fd and attributes combined as a single 64-bit value and then checking them both?  Was that safer than my suggestion for some reason?
> 
> If the only advantage is a slight performance improvement (especially in light of the huge difference that either approach has to the status quo, judging by your chart), then I'd argue for the simpler approach that's more easily analyzed for correctness.
> 

My approach was indeed treating both the fd and the attributes as a
single 64-bit number. This way we can read/write both fd and attributes
at the same time, which definitely solves all multithreading issues. I
still think that this is probably the best way to do it - with the only
disadvantage that it doesn't work on PowerPC. :/ I have attached a patch
showing this approach to the mail.

Using interlocked_cmpxchg() to verify only the fd doesn't seem to be
safe to me:

thread1 / get_cached_fd / read fd and attributes
thread2 / release old handle
thread2 / allocate new object with the same handle (unlikely, but ...)
thread1 / get_cached_fd / compares fd one more time -> looks like
everything okay

This basically means that applications could assume that wrong
attributes are associated with a specific file handle. I know that this
is very unlikely (especially because the handles contain the generation
in the upper 16bit), but it is still possible... Or is there any reason
why we can be sure that this doesn't happen?

> 
> If the above issue is addressed, that leaves one remaining problem: get_cached_fd() can encounter a false negative:
> 
> thread1 / add_fd_to_cache / set fd to -1
> thread1 / add_fd_to_cache / increment epoch
> thread2 / get_cached_fd / read epoch
> thread2 / get_cached_fd / read attributes
> thread2 / get_cached_fd / read fd, gets -1
> thread2 / get_cached_fd / check if epoch has changed, conclude it hasn't
> thread1 / add_fd_to_cache / set attributes
> thread1 / add_fd_to_cache / set fd
> thread2 / get_cached_fd / return -1
> thread2 / server_get_unix_fd / see get_cached_fd as having failed since it returns -1
> thread2 / server_get_unix_fd / do get_handle_fd server request
> thread2 / server_get_unix_fd / call add_fd_to_cache to cache the new fd
> 
> I think you want to try get_cached_fd() again, inside the critical section, if the first call fails.

This should be fixed in try 2.

>> > * Replace interlocked_xchg() with an assert(). The original code looks
>> > like its valid that the fd is nonzero at this point, but in fact this
>> > would cause releasing a file descriptor which is still in use.
> I'll leave it to others to comment on that.
>
>> > Replacing with an assert() also saves a couple of CPU cycles.
> I understand that this is an attempt to improve performance, but I
don't think it's necessary to squeeze every last CPU cycle out of this code.

This was not done to improve the performance, but instead to detect
errors more easily. If the fd is already set it always means that
something is going wrong. The current code hides these kind of errors by
just closing the file handle and replacing it - that was also the reason
why I assumed one call to get_cached_fd() is sufficient, as it looks
like the code handles already the case if fd is set - but in fact it
will close a handle which is still in use...

> 
> -Ken
> 

What do you think about the alternative (slightly slower, but much
easier to verify) solution attached to this mail?

Regards,
Sebastian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ntdll-Use-lockfree-implementation-for-get_cached_fd.patch
Type: text/x-patch
Size: 4589 bytes
Desc: not available
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20140531/f23ca097/attachment.bin>