[PATCH v3 0/2] MR135: Cancel asyncs queued by thread when the thread is terminated

Fri Jun 3 20:30:09 CDT 2022

On 6/3/22 15:12, Paul Gofman (@gofman) wrote:
> Inspired by the discussion for MR 127 (https://www.winehq.org/pipermail/wine-devel/2022-May/217832.html) and is aimed to replace the second patch (ws2_32: Use allocated IO status block in select().) in that MR.
> 
> When the thread calls server's terminate_thread for itself queuing the system APC though SIGUSR1 might race with the thread exit, there seems no easy and reliable way to be sure that signal is processed before the thread's stack and TEB are freed. Thus I am making the system APCs to be queued to the thread to be processed by server_select() after (terminate_thread) server call.
> 
> When terminate_thread is called for the other thread the system APC will always be delivered through some other thread. Thread stack and TEB are not freed when the thread is killed externally, so completing the async from another thread should probably be fine.

So, to be clear, in the offending program:

(1) thread terminates itself via NtTerminateThread();

(2) thread calls exit_thread(), which means that its TEB and stack are 
queued to be freed but are not actually freed yet;

(3) thread closes its server socket, causing the server to notice and 
kill_thread( thread, 0 );

(4) another thread exits, and the TEB of the first thread is freed

(5) select async is eventually signaled (we never cancel it, although we 
should), and the server ends up queueing SIGUSR1 to some random thread, 
since the original thread is terminated

(6) APC_ASYNC_IO handler in said random thread tries to access the 
original thread's stack and crashes

Do I have that right?

So, with that said, thoughts:

* If a thread dies violently, we can cancel its asyncs from the server 
side [in kill_thread(), probably? or the terminate_thread handler; see 
below], and we don't need to worry about them accessing the stack, since 
the stack is never freed.

* If a thread dies naturally (terminates itself), we can cancel its 
asyncs from the terminate_thread handler, and then, in the client side 
but before calling exit_thread(), do a zero-length wait with 
SELECT_INTERRUPTIBLE, exactly as in v1 of this patch set. My 
understanding is that these alone are actually sufficient to ensure that 
any APC_ASYNC_IO resulting from the async termination are executed by 
the thread itself before it frees its stack. I.e. we don't need any of 
the "queue_only" bits that were in v1. This holds because:

   - terminate_thread doesn't set the state to TERMINATED yet in the 
"self" case;

   - as a result queue_apc() will queue the (system) APC to that 
specific thread;

   - server_select( SELECT_INTERRUPTIBLE ) is actually sufficient to 
process the entire APC queue. It doesn't matter if SIGUSR1 is sent (and 
in this case it will always be—we're in the middle of a server request 
but not a server *select* at the point the APC is queued), SIGUSR1 is 
only needed to wake up a thread that isn't otherwise paying attention.

* There are some other potential bugs in this area I noticed while 
looking around (e.g. we should probably requeue system APCs of a 
terminated thread to another thread instead of dropping them on the 
floor, also apparently cancellation should always wait), but those seem 
orthogonal to the problem here and so can wait.

So to fix the bug it should be sufficient to

(a) cancel all thread asyncs in the terminate_thread handler,

(b) server_select( SELECT_INTERRUPTIBLE ) before calling exit_thread()

Does that sound right? Am I missing anything in my logic above?