Optimizing synchronization objects

Sun Sep 13 17:47:44 CDT 2015

Alexandre,

First off I apologize for the size of this email, I'm trying to keep it 
as concise as possible.

I've been experimenting with ways to optimize synchronization objects 
and have implemented a promising proof of concept for semaphores using 
glibc's nptl (posix) semaphore implementation. I posted revision 3 of 
this today, although I appear to have used the wrong msg id in the 
--in-reply-to header. :( So my goal is to eventually make similar 
optimizations for all synchronization objects, or at least those that 
have demonstrable performance problems.

The basic theory of operation is that when a client sends a 
create_semaphore, the server creates a posix semaphore with a unique 
name which is passes to the client process so that it can open it 
locally. This allows the client to perform ReleaseSemaphore without a 
server call as well as WaitFor(Multiple|Single)Object(s) for cases where 
the wait condition can be determined to be satisfied without a server 
call (i.e., either bWaitAll = FALSE and a signalled semaphore is found 
in the handle list prior to a non-semaphore objects or bWaitAll = TRUE 
and all handles are signalled semaphores). For all other conditions, it 
uses a traditional server call.

However, it has two problems:

 1. It uses glibc's implementation of POSIX semaphores which uses shared
    memory to share them with other processes, and
 2. It uses glibc's implementation of POSIX semaphores which are
    incompatible across 32- and 64-bit ABI processes.

I have not been able to find any more flaws in a case where both program 
and wineserver are the same ABI. All tests pass and I've added one more 
(although more tests are clearly needed). Since this implementation only 
uses sem_trywait (and never sem_wait or sem_timedwait), we don't really 
even need a full-featured semaphore -- a simple 32- or 16-bit number 
that's accessed atomically would suffice as a replacement. Although I 
did plan to eventually explore having a client program block w/o calling 
the server, the benefit of that is minimal compared to the benefit of 
being able to avoid the server call for releasing a semaphore and 
"wait"ing when the semaphore is already available.

So now I want to understand the minimum threshold of acceptability in 
wine for such a mechanism. We discussed this in chat and quite a bit and 
can I see many possibilities, each with its own particular issues. I'm 
listing them in order of my personal preference (most preferred first).

Option 1: Simple shared memory & roll our own semaphore
Similar to what glibc's NPTL semaphores are doing, except that we would 
only need a single integral value and not even a futex. The obvious 
downside is that a process can corrupt this memory and cause dysfunction 
of other processes who also have semaphores in that page. This could be 
minimized by giving every process their own page that is only shared 
between the server and the process unless a semaphore in that process is 
shared with another program, at which time the memory page could be 
shared with that process as well. Thus, the scope of possible corruption 
is limited to how far you share the object.

In the worse case of memory corruption, the wineserver would either 
leave a thread of one of these processes hung, release one when it 
shouldn't be released or determine that the memory is corrupted, issue 
an error message, set the last error to something appropriate and return 
WAIT_FAILED.

Option 2: System V semaphores
On Linux, these are hosted in the kernel, so you can't just accidentally 
overwrite them. They will be slightly slower than shared memory due to 
the system call overhead. You probably know them better than I, but at 
the risk or stating the obvious, the following are their limitations. 
Their max value on Linux is SHRT_MAX so any request for a higher 
lMaximumCount would have to be clipped. There are also limits on Linux 
that can be adjusted by root if needed for some application, but the 
default is a maximum of 32000 (SEMVMX) total semaphores on the system, 
128 (SEMMNI) semaphore sets and a max size of 250 (SEMMSL) semaphores 
per set. They are also persistent, so if the wine server crashes, they 
can leave behind clutter.

Option 3: Move semaphores completely into the client
In this scenario, the wine server can never be exposed to corrupted 
data. It is very fast when locking can be performed in the client, but 
very complicated and potentially slower for mixed locks. Calls to 
WaitForMultipleObjectsEx containing both semaphores and other objects 
(especially with bWaitAll = TRUE) may require multiple request/reply 
cycles to complete. The client must successfully lock the semaphores 
prior to the server calling satisfied on the server-side objects.

Here is an optimistic use case that only requires a single request/reply 
cycle

 1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of
    semaphores and other objects
 2. Client calls trywait on all semaphores, which succeeds.
 3. Client passes request to server (with semaphore states) and blocks
    on pipe
 4. Server gets value of all server-side objects and determines that the
    condition can be satisfied, so calls satisfied on all objects
 5. Server sends response to client
 6. Client wakes up and completes the wait call.

Here is a slightly less optimistic case

 1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of
    semaphores and other objects
 2. Client calls trywait on all semaphores, which fails on one semaphore.
 3. Client rolls back locks on all which had succeeded.
 4. Client passes request to server (with semaphore states)
 5. Client blocks on the semaphore that was locked and the server pipe
 6. Server updates thread status (blocking on native object)
 7. Semaphore is signaled and client wakes up
 8. Lock is obtained on semaphore that was previously locked
 9. Client now calls trywait on remaining semaphores which again succeeds.
10. Client sends update to server and blocks on pipe
11. Server checks all server-side objects, which are all signaled, so
    calls satisfied on all objects
12. Server updates thread status and notifies client
13. Client wakes up and completes wait call.

As you can see this can get more complicated. If the server discovers 
that an server object isn't signaled it will have to notify the client 
to rollback the locks and wait for server objects to be ready.

So which of these solution is most appealing to you?