[PATCH v2 1/1] ntdll: Fix reading stale Information from IOSB.

Mon May 30 21:10:04 CDT 2022

On 5/30/22 20:23, Jin-oh Kang wrote:
> On Tue, May 31, 2022, 7:52 AM Zebediah Figura wrote:
> 
>> On 5/30/22 14:20, Jinoh Kang wrote:
>>> +
>>> +    /* GetOverlappedResultEx() skips waiting for the event/file handle
>> if the
>>> +     * Status field indicates completion, and returns the result from
>> the
>>> +     * Information field.
>>> +     *
>>> +     * Hence, we have to ensure that writes to the Information field
>> are seen
>>> +     * from the other threads *before* writes to the Status field.
>>> +     *
>>> +     * Otherwise, a race condition results: if writing the Status field
>> has
>>> +     * been completed but the subsequent update to the Information
>> field has
>>> +     * not, the application may end up reading stale value from it.
>>> +     *
>>> +     * The race condition can be especially problematic with
>> applications that
>>> +     * use the HasOverlappedIoCompleted() macro in a tight loop to
>> busy-wait
>>> +     * for completion of overlapped I/O.
>>> +     */
>>> +    MemoryBarrier();  /* TODO: consider replacing with a store-release.
>> */
>>> +    *(volatile NTSTATUS *)status_ptr = status;
>>>    }
>>>
>>>    static inline client_ptr_t iosb_client_ptr( IO_STATUS_BLOCK *io )
>>
>> Unpaired MemoryBarrier() is pretty much always wrong. (In fact, I think
>> unpaired anything would be, here.)
>>
> 
> By pairing, do you mean that another MemoryBarrier() should follow the
> status write?

No. We need another memory barrier in GetOverlappedResult().

The point of a memory barrier is that if you have two pairs of memory 
operations (read or write) which must be perceived as happening "in 
order" across threads, you place a memory barrier in the middle of each 
pair. That is, if thread 1 executes ops A and B, and thread 2 executes 
ops C and D, you need a barrier between A and B, but also a barrier 
between C and D.

The following image (stolen and adapted from the Linux kernel 
documentation) may be illustrative:

         CPU 1                       CPU 2
         ===============             ===============
         WRITE_ONCE(a, 1);           x = READ_ONCE(b);
         <write barrier>  <--------> <read barrier>
         WRITE_ONCE(b, 2);           y = READ_ONCE(a);

You need to do this because the compiler—and, more importantly, the 
CPU—is free to reorder the instructions on *both* sides. If, as in the 
above example, you need "x == 2" to imply "y == 1", the write barrier 
alone is not enough to achieve this, because the CPU can *also* reorder 
the reads.

The same applies to atomic operations that aren't just RMW. E.g. a 
store-release (to Status) would be correct here, and more efficient than 
a barrier, but it has to be paired with a read-acquire (also of Status) 
in GetOverlappedResult().