HLSL offsetting

Wed Jun 8 20:33:02 CDT 2022

The following thread is based partly on, and makes reference to, private 
conversation, but for the sake of openness I've elected to post it to 
wine-devel.

A long time ago, HLSL_IR_LOAD—then called HLSL_IR_DEREF—was this:

         enum hlsl_ir_deref_type
         {
             HLSL_IR_DEREF_VAR,
             HLSL_IR_DEREF_ARRAY,
             HLSL_IR_DEREF_RECORD,
         };

         struct hlsl_deref
         {
             enum hlsl_ir_deref_type type;
             union
             {
                 struct hlsl_ir_var *var;
                 struct
                 {
                     struct hlsl_ir_node *array;
                     struct hlsl_ir_node *index;
                 } array;
                 struct
                 {
                     struct hlsl_ir_node *record;
                     struct hlsl_struct_field *field;
                 } record;
             } v;
         };

         struct hlsl_ir_deref
         {
             struct hlsl_ir_node node;
             struct hlsl_deref src;
         };

Now, one problem with this is that it was kind of mean to RA and 
liveness analysis. For example, a line of HLSL like

     var.a.b = 2.0;

produced the following IR:

     2: 2.0
     3: deref(var)
     4: @3.b
     5: @4.c = @2

This is annoying because:

  * to discover that "var" is written, @5 needs to reach upwards through 
a deref chain;

  * reaching through the deref chain requires lots of assert() statements;

  * @3 implies that "var" is read, which it isn't (and, if we reach 
upwards through the deref chain, @4 implies the same thing).

I proposed that instead of using generic node pointers, we could have
arbitrarily long deref chains encoded in the hlsl_deref structure 
itself. [1]

There was some discussion on that—which is mostly concentrated in that 
thread, and also IRC. Most of the concern is about being nicer to 
liveness analysis and RA.

What ultimately ended up happening is that Matteo proposed numeric 
(register) offsets calculated at parse time, which is fundamentally 
similar to my idea except that it's a lot simpler to work with.

Interestingly, the problem of multiple register sets was brought up [2]:

     From my testing it essentially does, yes, i.e. if you have

     struct { int unused; float f; bool b; } s;
     float4 main(float4 pos : POSITION) : POSITION
     {
         if (s.b)
         return s.f;
         return pos;
     }

     then "s" gets allocated to registers b0-b2 and c0-c1, but only b2
     and c1 are ever used.

     So yeah, it makes things pretty simple. I can see how it would have
     been a lot uglier otherwise.

I guess we've finally run into that ugliness now :-(

The ultimate conclusions to draw from this historical exercise are:

- what I said about "we used to have derefs handled like that" is mostly 
correct, although not quite. We did used to have more rich type 
information, and we did decide that offsets calculated at parse time 
were preferable to that type information, although I thought we at one 
point had something like [1] in the tree, which we didn't. Anyway the 
decision to use offsets calculated at parse time seems to have been 
motivated only by simplicity. To be fair, at the time, it *was* simpler.

- [1] and the later patch that replaced it were mostly motivated by RA. 
We will probably end up doing RA after SMxIR translation, but we may 
very likely do RA *before* it as well (tracking e.g. SMx instructions 
with register numbers instead of having def-use chains.) A more salient 
concern is that I still don't like the idea of having instructions in 
the tree that aren't actually translated (or translatable) to SMxIR, 
which means that we shouldn't have instructions that yield e.g. structs.

The ugliness that we've run into is: how do we emit IR for the following 
variable load?

     struct apple
     {
         int a;
         struct
         {
             Texture2D b;
             int c;
         } s;
     } a;

     /* in some expression */
     func(a.s);

Unlike the SM1 example above, the register numbers don't match up. 
Separately, it's kind of ugly that backend-specific details regarding 
register size and alignment are leaking into the frontend so much. 
Similarly, the amount of code that has to deal with matrix majority is 
unfortunate.

The former problem can potentially be solved by embedding multiple 
register offsets into hlsl_deref (one per register type). Neither this 
nor the latter problem are prohibitive, and I was at one point in favour 
of continuing to use register offsets everywhere, but at this point my 
feeling has changed, and I think using register offsets is looking more 
ugly than the alternatives. I get the impression that Francisco 
disagrees, though, which is why we should probably hash this out now.

Nor do I think we should use both register offsets and component offsets 
(either in the same node type, or in different node types). That just 
makes the IR way more complicated. Rather, I think we should be doing 
everything in *just* component offsets until translation from HLSL IR to 
SMx IR.

In order to deal with the problem of translating dynamic offsets from 
components to registers, I see three options:

(a) emit code at runtime, or do some sophisticated lowering,

(b) use special offsetof and sizeof nodes,

(c) introduce a structured deref type, much like [1]. Francisco was 
actually proposing something like this, although with an array instead 
of a recursive structure, which strikes me as an improvement.

My guess is that (a) is very hard. I haven't really tried to reason it 
out, though.

Given a choice between (b) and (c), I'm more inclined to pick (c). It 
makes the IR structure more restrictive, and those restrictions 
fundamentally match the structured nature of the language we're working 
with, both things I tend to like.

Note that either way we're going to need specialized functions to 
resolve deref offsets in one step. I also think that should depend on 
the domain—e.g. for copy-prop we'll actually want to do everything in 
component counts, but when translating to SMxIR we'll evaluate given the 
register alignment constraints of the shader model. In the case of (b) 
it's not going to be as simple as running the existing constant folding 
pass, because we can't actually fold the sizeof/offsetof constants 
(unless we dup the node list, evaluate, and then fold, which seems very 
hairy and more work than the alternative).

I invite thoughts—especially from Matteo, since we discussed this sort 
of problem ages ago.

ἔρρωσθε,
Zeb

[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html

[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html