[RFC PATCH v4 1/5] server: Allow calling async_handoff() with status code STATUS_ALERTED.

Thu Feb 3 23:50:35 CST 2022

On 2/4/22 03:20, Zebediah Figura wrote:
> On 2/3/22 09:28, Jinoh Kang wrote:
>> On 2/3/22 08:45, Zebediah Figura wrote:
>>> On 1/29/22 01:47, Jinoh Kang wrote:
>>
>> ...
>>
>>>> diff --git a/server/async.c b/server/async.c
>>>> index 7aef28355f0..e169bb23225 100644
>>>> --- a/server/async.c
>>>> +++ b/server/async.c
>>>> @@ -338,6 +338,30 @@ obj_handle_t async_handoff( struct async *async, data_size_t *result, int force_
>>>>            return async->wait_handle;
>>>>        }
>>>>    +    if (!async->pending && get_error() == STATUS_ALERTED)
>>>
>>> Why "!async->pending"?
>>
>> The rationale is twofold:
>>
>> 1. async_handoff() can be called with "async->pending = 1".
>>     Notably the device manager and some sock IOCTLs call set_async_pending() before returning.
>>     Also, there exists another conditional statement below which explicitly tests for "async->pending".
> 
> Yes, but they don't set the last error to STATUS_ALERTED, do they?

They don't (yet), hence the idea of "assert()".
I suppose it's better to quit early so it'd be easiler to debug if STATUS_ALERTED _somehow_ finds its way to sneak in and screw up everything due to missing APC...

> 
> Anything that's triggered below should only apply to asyncs whose initial status is known. Consider that "unknown_status" not only means the initial NTSTATUS is unknown, but it also means the initial pending state is unknown.

It appears that existing code dictates that the initial pending state of unknown_status async is, well, "not pending."  More on it below.

> 
>>
>> 2. If "async->pending && get_error() == STATUS_ALERTED", the async will be left in an inconsistent state.
>>
>>     - There exists an established invariant that "async->unknown_status" implies "!async->pending".
>>       It would be invalid that "async->unknown_status && async->pending", since an async
>>       "even the initial state" of which "is not known yet" cannot possibly be queued as STATUS_PENDING.
>>
>>     - Unless "async->direct_result", an APC_ASYNC_IO is required for the async to progress;
>>       Our code path however issues no APC calls, leaving the async in a stale, hard-to-debug state.
>>
>> That said, an alternative would be to simply "assert( !async->pending )" inside the new code path.
>> However, I don't really see a reason to reject STATUS_ALERTED on a pending async; we can let it
>> simply follow the normal path, allowing the async to be terminated with STATUS_ALERTED as usual.
>> I think this best sticks to the principle of least surprise.  It's entirely my own perspective, though...
> 
> If you consider the above, "async->pending" is semantically undefined if async->unknown_status is 1. We shouldn't depend on it being one thing or the other, and checking for that (even in an assert) I think gives the wrong idea that we care.

On the contrary, there are a total of three places in async.c that do rely on the fact that "async->unknown_status" implies "!async->pending".
For example, note the following code in async_set_result():
>        /* don't signal completion if the async failed synchronously
>         * this can happen if the initial status was unknown (i.e. for device files)
>         * note that we check the IOSB status here, not the initial status */
>        if (async->pending || !NT_ERROR( status ))
>        {
>            if (async->data.apc)
>            {
>                apc_call_t data;
>                memset( &data, 0, sizeof(data) );
>                data.type         = APC_USER;
>                data.user.func    = async->data.apc;
>                data.user.args[0] = async->data.apc_context;
>                data.user.args[1] = async->data.iosb;
>                data.user.args[2] = 0;
>                thread_queue_apc( NULL, async->thread, NULL, &data );
>            }
>            else if (async->data.apc_context && (async->pending ||
>                     !(async->comp_flags & FILE_SKIP_COMPLETION_PORT_ON_SUCCESS)))
>            {
>                add_async_completion( async, async->data.apc_context, status, total );
>            }
>
>            if (async->event) set_event( async->event );
>            else if (async->fd) set_fd_signaled( async->fd, 1 );
>        }

There's another one in async_terminate():
>        /* this can happen if the initial status was unknown (i.e. for device
>         * files). the client should not fill the IOSB in this case; pass it as
>         * NULL to communicate that.
>         * note that we check the IOSB status and not the initial status */
>        if (NT_ERROR( status ) && (!is_fd_overlapped( async->fd ) || !async->pending))
>            data.async_io.sb = 0;
>        else
>            data.async_io.sb = async->data.iosb;

(The other one is in async_handoff, but the check is not done if unknown_status is set anyway.)

Notice that they are using "!async->pending" as a covering condition for "async->unknown_status".
So I'd argue it's not semantically undefined at all; the semantics of "unknown_status" actually _entails_ that the async is not "pending".
It turns out this plays a significant role in the lifecycle of IRP-based asyncs.  Assuming the device file is not opened for synchronous I/O, the process is as follows:

1. Initially, queue_irp() marks the IRP async as having unknown status.

2. async_handoff() sees that the unknown_status flag is set, and keeps the wait handle open. The wait handle returned to the client.

3. The client starts to block on the wait handle.

4. Later, the driver signals the initial status of the IRP as STATUS_PENDING (the IRP has been queued internally by the driver).

5. The get_next_device_request handler sets the initial status and pending state of async, and wakes up the client.

6. The client stops waiting and returns the status, which is possibly STATUS_PENDING.

In steps 1-4, the async stays in the "not pending" state.  Had the async been in "pending" state and the I/O failed, step 5 would erroneously trigger APC / IOCP notification mechanism since the failure is treated as asynchronous instead of synchronous.

In this regard, I would argue that the pending flag is best documented as "has the async entered the asynchronous phrase, so that the status would be reported to the client even if it failed?".

In conclusion, we depend on it and we should care.

> 
>>
>>>
>>>> +    {
>>>> +        /* Give the client opportunity to complete synchronously.
>>>> +         * If it turns out that the I/O request is not actually immediately satiable,
>>>> +         * the client may then choose to re-queue the async (with STATUS_PENDING).
>>>
>>> I'd explicitly mention *how* the async should be requeued here.
>>
>> I worded it that way since each I/O operation can have their own way to manage their async queue
>> (e.g. socket operations would use sock->read_q/write_q, device managers would use its own IRP queue,
>> while others may just use fd->read_q/write_q).  I chose to retain the abstractness so that the comment would't get easily stale or out of date.
>>
>> That said, I suppose you specifically meant mentioning how to _transition_ the async to STATUS_PENDING.
>> I agree with you in that case.  I'll add a mention of the new wineserver call here.
> 
> Yeah, that's what I meant :-)
> 
>>
>>>
>>>> +         */
>>>> +        async->unknown_status = 1;
>>>> +
>>>> +        /* Deactivate async so that it can be safely queued.
>>>> +         * Otherwise, the async will be erroneously alerted the next time
>>>> +         * someone calls async_wake_up() on the queue.
>>>
>>> Not to put too much weight on my own perspective, but I think it's clearer to explain *why* to do something, rather than why not.
>>
>> By "why not" did you meant the "otherwise" part?
>>
>>> In this case, what we're doing is mimicking a normal "alert" case, except that we're not using an APC.
>>
>> If I understood correctly, I should just say instead:
>>> Emulate calling async_terminate() with STATUS_ALERTED (request client to complete I/O),
>>> but don't actually queue APC_ASYNC_IO under any circumstances.
>> Is this correct?
> 
> Something more like that, yeah. Although, to contradict my earlier assertion, I think it'd also help to explain what's wrong with using APC_ASYNC_IO here.

-- 
Sincerely,
Jinoh Kang