Fast thread-local storage for OpenGL drivers

Fri Feb 21 14:11:16 CST 2003

It is critically important for OpenGL drivers to have fast
(single-instruction) access to thread local variables.  I'd be happy to
provide more information to anyone who's interested, but a typical case
where TLS access can severely hurt performance is at the very front-end of
an OpenGL library.  Ideally, you'd like something like the following:

	libGL.so:
		// This function loads a dispatch table pointer from
		// thread-local storage and jumps through to the
		// backend function (which typically resides in a
		// different shared library).
		glTexCoord2f:
			mov %fs:DISPATCH_TABLE_OFFSET, %eax
			jmp *__glapi_TexCoord2f(%eax) // Points to
__my_TexCoord2f

	libGLcore.so:
		// This function copies some data into the OpenGL
		// context, sets some magic flags to record what data
		// was copied, and returns.
		__my_TexCoord2f:
			mov %fs:CONTEXT_OFFSET, %eax
			// Copy 2 floats into the context
			// Set a flag
			ret

All in all, you have 2 TLS accesses in less than 10 instructions or so.
Even if you don't understand exactly what's going on here, you can see that
it is important to have fast access to thread-local data.

While glibc's new thread library implementation has many benefits,
particularly to application programmers (with support for the new keyword
'__thread', and so on), it basically forces a function call per thread local
variable access for situations like the one I described above.  This is
clearly unacceptable for a high-performance OpenGL driver.  Furthermore, the
glibc developers have been completely unwilling to work with OpenGL driver
developers (Open Source or otherwise) to provide a mechanism to access
thread-local data in a way that meets our performance requirements.

Therefore, I'd like to propose a solution where Wine and the OpenGL driver
cooperate to provide such a TLS access mechanism (at least on x86
platforms).  Wine currently uses %fs to access the Windows Thread
Environment Block (TEB), while glibc uses %gs to access its per-thread data.
With the following patch to Wine's TEB structure:

--- include/thread.h	2002-12-17 16:06:25.000000000 -0500
+++ include/thread.h.new	2003-02-21 14:27:50.000000000 -0500
@@ -116,10 +116,12 @@
     DWORD        alarms;         /* --3 22c Data for vm86 mode */
     DWORD        vm86_pending;   /* --3 230 Data for vm86 mode */
     void        *vm86_ptr;       /* --3 234 Data for vm86 mode */
+
     /* here is plenty space for wine specific fields (don't forget to
change pad6!!) */
+    DWORD        pad6[608];      /* --n 238 */
+    DWORD        ogl_data[16];   /* --n bb8 OpenGL driver private data */
 
     /* the following are nt specific fields */
-    DWORD        pad6[624];                  /* --n 238 */
     UNICODE_STRING StaticUnicodeString;      /* -2- bf8 used by advapi32 */
     USHORT       StaticUnicodeBuffer[261];   /* -2- c00 used by advapi32 */
     void        *stack_base;                 /* -2- e0c Base of the stack
*/

we reserve %fs:0xbb8 to %fs:0xbf8 for use by the OpenGL driver.  Any and all
OpenGL implementations can use this area, and we agree that when Wine is
present, it leaves this area untouched.  The question of who allocates the
TEB should be pretty straight forward: when an OpenGL driver is first
loaded, if the TEB is missing it is allocated as expected.  I would imagine
when Wine is running that it would have the chance to allocate the TEB
before the OpenGL driver is loaded, and thus the OpenGL driver wouldn't have
to do anything.  The size of the reserved area should be sufficient,
although we can debate that if required.

Comments, questions are welcome.  I've CC'ed Brian Paul and Keith Whitwell
of Mesa/DRI fame, as I know they are interested in this issue.  Please CC us
on any replies, as we are not subscribed to the list.

--
Gareth Hughes (gareth at nvidia.com)
OpenGL Developer, NVIDIA Corporation