Fast thread-local storage for OpenGL drivers

Sat Feb 22 12:06:35 CST 2003

On Sat, Feb 22, 2003 at 09:51:26AM -0800, Gareth Hughes wrote:
> Roland McGrath wrote:
> > 
> > These people clearly haven't read all of the TLS paper, or looked at the
> > GCC implementation of __thread long enough to notice -ftls-model and
> > __attribute__ ((tls_model)).
> 
> This is what I was talking about.  I've read the entire document several
> times, and still can't see a way that a dynamically loadable shared library
> can be guaranteed to use the single-instruction Local Exec access model.  If
> I'm wrong, please explain why.
> 
> > I think the TLS document intends to explain what the models mean in
> > practical terms on each architecture, but I can believe it's not all
> > that clear.  The GCC manual doesn't explain the access models and code
> > sequences, just tells you how to tell the compiler what you want in the
> > terms that the TLS document defines.  
> > 
> > If you want maximal flexibility, i.e. to always work with dlopen, then
> > indeed you must use the "dynamic" TLS access models (GD or LD).  You can
> > use the Initial Exec model if you want faster accesses at the cost of some
> > flexibility.
> 
> libGL.so simply has to work with dlopen -- if for no other reason than
> essentially all major 3D games (Quake3, Doom3, UT2003 etc) dlopen libGL.so
> rather than linking with it.  This is not going to change.

Note the "always" in Roland's paragraph.

> > In glibc, we actually allocate some excess space in the thread-local
> > storage area layout determined at startup time.  This lets a dynamically
> > loaded module use static TLS if its PT_TLS segment fits in the available
> > surplus.  (In sysdeps/generic/dl-tls.c, see TLS_STATIC_SURPLUS.)  If there
> > is insufficient space preallocated, then loading the module will fail.  In
> > fact, we put this feature there with GL in mind and can adjust the
> > preallocated surplus for what is most useful in practice.
> 
> I think the set of performance critical thread-local variables is something
> like two or three (depending on the implementation).  The libGL.so API
> dispatcher needs fast access to one or two of these (dispatch table
> pointers), while the driver backend needs fast access to all of them
> (context pointer and dispatch table pointers).  The other thread-local
> variables are generally not accessed in performance-critical situations.

When you say two or three, are these two or three pointers or two or
three large tables?

In any case, it sounds like you could:
 - select the thread-local variables that you need fast access to
 - Arrange for those variables to be tagged with an
   __attribute__((tls_model("initial-exec"))), or something similar.
 - Make sure the TLS_STATIC_SURPLUS is big enough to hold them.

> Another issue I forgot to mention, or forgot to make clear, is that we need
> to be able to access these thread-local variables in runtime generated code.
> A driver's top-level API functions are often generated at runtime, and need
> to be able to do things like switch dispatch tables (obviously, they'd have
> direct access to the context they were associated with, and so wouldn't need
> to go through the pointer in TLS).  Are we guaranteed that the __thread
> variables aren't going to move around?  How would we work out what code to
> generate to access a given __thread variable?

I don't see a problem, but you'd have to do some serious reading of the
TLS ABI documents.... they're quite thorough.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer