HLSL offsetting

Thu Jun 9 04:04:53 CDT 2022

First of all, thanks for starting this thread: very nice overview of
the problem and what we tried and figured out over time.

On Thu, Jun 9, 2022 at 3:33 AM Zebediah Figura <zfigura at codeweavers.com> wrote:
>
> The following thread is based partly on, and makes reference to, private
> conversation, but for the sake of openness I've elected to post it to
> wine-devel.
>
> A long time ago, HLSL_IR_LOAD—then called HLSL_IR_DEREF—was this:
>
>          enum hlsl_ir_deref_type
>          {
>              HLSL_IR_DEREF_VAR,
>              HLSL_IR_DEREF_ARRAY,
>              HLSL_IR_DEREF_RECORD,
>          };
>
>          struct hlsl_deref
>          {
>              enum hlsl_ir_deref_type type;
>              union
>              {
>                  struct hlsl_ir_var *var;
>                  struct
>                  {
>                      struct hlsl_ir_node *array;
>                      struct hlsl_ir_node *index;
>                  } array;
>                  struct
>                  {
>                      struct hlsl_ir_node *record;
>                      struct hlsl_struct_field *field;
>                  } record;
>              } v;
>          };
>
>          struct hlsl_ir_deref
>          {
>              struct hlsl_ir_node node;
>              struct hlsl_deref src;
>          };
>
> Now, one problem with this is that it was kind of mean to RA and
> liveness analysis. For example, a line of HLSL like
>
>      var.a.b = 2.0;
>
> produced the following IR:
>
>      2: 2.0
>      3: deref(var)
>      4: @3.b
>      5: @4.c = @2
>
> This is annoying because:
>
>   * to discover that "var" is written, @5 needs to reach upwards through
> a deref chain;
>
>   * reaching through the deref chain requires lots of assert() statements;
>
>   * @3 implies that "var" is read, which it isn't (and, if we reach
> upwards through the deref chain, @4 implies the same thing).
>
> I proposed that instead of using generic node pointers, we could have
> arbitrarily long deref chains encoded in the hlsl_deref structure
> itself. [1]
>
> There was some discussion on that—which is mostly concentrated in that
> thread, and also IRC. Most of the concern is about being nicer to
> liveness analysis and RA.
>
> What ultimately ended up happening is that Matteo proposed numeric
> (register) offsets calculated at parse time, which is fundamentally
> similar to my idea except that it's a lot simpler to work with.
>
> Interestingly, the problem of multiple register sets was brought up [2]:
>
>      From my testing it essentially does, yes, i.e. if you have
>
>      struct { int unused; float f; bool b; } s;
>      float4 main(float4 pos : POSITION) : POSITION
>      {
>          if (s.b)
>          return s.f;
>          return pos;
>      }
>
>      then "s" gets allocated to registers b0-b2 and c0-c1, but only b2
>      and c1 are ever used.
>
>      So yeah, it makes things pretty simple. I can see how it would have
>      been a lot uglier otherwise.
>
> I guess we've finally run into that ugliness now :-(
>
> The ultimate conclusions to draw from this historical exercise are:
>
> - what I said about "we used to have derefs handled like that" is mostly
> correct, although not quite. We did used to have more rich type
> information, and we did decide that offsets calculated at parse time
> were preferable to that type information, although I thought we at one
> point had something like [1] in the tree, which we didn't. Anyway the
> decision to use offsets calculated at parse time seems to have been
> motivated only by simplicity. To be fair, at the time, it *was* simpler.

It still is simpler, baseline. Unfortunately it's now clear that in
general it comes with a cost i.e. register offsets aka SM-dependent
details leaking through the higher level IR. I don't think that should
automatically make the current solution null and void though.

I'll also note that we could potentially support two different ways to
do "derefs", the current one and something similar to the old way (or
one of the replacements mentioned later). Probably not worth the
complexity but it is a possibility.

> - [1] and the later patch that replaced it were mostly motivated by RA.
> We will probably end up doing RA after SMxIR translation, but we may
> very likely do RA *before* it as well (tracking e.g. SMx instructions
> with register numbers instead of having def-use chains.)

Yes. We should at the very least keep the door open to that.

> A more salient
> concern is that I still don't like the idea of having instructions in
> the tree that aren't actually translated (or translatable) to SMxIR,
> which means that we shouldn't have instructions that yield e.g. structs.

It seems like a fair design choice.

>
> The ugliness that we've run into is: how do we emit IR for the following
> variable load?
>
>      struct apple
>      {
>          int a;
>          struct
>          {
>              Texture2D b;
>              int c;
>          } s;
>      } a;
>
>      /* in some expression */
>      func(a.s);
>
> Unlike the SM1 example above, the register numbers don't match up.
> Separately, it's kind of ugly that backend-specific details regarding
> register size and alignment are leaking into the frontend so much.

I think most of that can be hidden or contained with some proper
abstraction. And generous handwaving.
But basically, that probably could be represented in the IR as copying
around individual fields of the structure separately, rather than a
single "struct deref". Clearly it can become more complex depending on
the type of the variable but I think it should be doable.

> Similarly, the amount of code that has to deal with matrix majority is
> unfortunate.

That personally seems more annoying. Although it's not clear to me
that handling matrix majority at a later stage is necessarily any
better.

> The former problem can potentially be solved by embedding multiple
> register offsets into hlsl_deref (one per register type). Neither this
> nor the latter problem are prohibitive, and I was at one point in favour
> of continuing to use register offsets everywhere, but at this point my
> feeling has changed, and I think using register offsets is looking more
> ugly than the alternatives. I get the impression that Francisco
> disagrees, though, which is why we should probably hash this out now.

As I mention below, I currently see two options as the most appealing.
This one (multiple register offsets) sits somewhat in the middle and
it feels like it would be best to go to one of the extremes instead.
It's also possible that this middle ground solution would end up being
nicer in practice. At any rate, I certainly wouldn't flat out discount
it.

> Nor do I think we should use both register offsets and component offsets
> (either in the same node type, or in different node types). That just
> makes the IR way more complicated. Rather, I think we should be doing
> everything in *just* component offsets until translation from HLSL IR to
> SMx IR.

I touched on this earlier and I agree that the additional complexity
is unlikely to be worth it. Admittedly we're in a limbo right now
where SMxIR isn't quite there yet, which makes reasoning on some of
these details a bit fuzzy.

> In order to deal with the problem of translating dynamic offsets from
> components to registers, I see three options:
>
> (a) emit code at runtime, or do some sophisticated lowering,
>
> (b) use special offsetof and sizeof nodes,
>
> (c) introduce a structured deref type, much like [1]. Francisco was
> actually proposing something like this, although with an array instead
> of a recursive structure, which strikes me as an improvement.
>
> My guess is that (a) is very hard. I haven't really tried to reason it
> out, though.
>
> Given a choice between (b) and (c), I'm more inclined to pick (c). It
> makes the IR structure more restrictive, and those restrictions
> fundamentally match the structured nature of the language we're working
> with, both things I tend to like.

After giving it some thought I think that's certainly fine *for the
higher level IR*. At the same time it seems to me that, if we go that
route, eventually we also want to have real SMxIR with register
offsets, and make sure that we can optimize constant offsets (thus
expressions) at that level.

As I see it (as of current time and date, can't guarantee that I won't
change my mind again...) we either push the backend-specific info up
(register offsets all the way) or down (component offsets with
structured deref / type info in the generic IR, transformation into
register offsets in the SMxIR). I think either option works and it's
mostly a matter of preference and which one fits / feels better with
the rest of the compiler.

> Note that either way we're going to need specialized functions to
> resolve deref offsets in one step. I also think that should depend on
> the domain—e.g. for copy-prop we'll actually want to do everything in
> component counts, but when translating to SMxIR we'll evaluate given the
> register alignment constraints of the shader model. In the case of (b)
> it's not going to be as simple as running the existing constant folding
> pass, because we can't actually fold the sizeof/offsetof constants
> (unless we dup the node list, evaluate, and then fold, which seems very
> hairy and more work than the alternative).

Right, each option will have different tradeoffs WRT optimization
passes. But e.g. copy-prop should be doable even with register
offsets, we "just" need to make sure to always map the component
offsets to their respective register offsets.

> I invite thoughts—especially from Matteo, since we discussed this sort
> of problem ages ago.

Yep, hope that my comments make sense. I want to hear from the others too.

>
> ἔρρωσθε,
> Zeb
>
>
> [1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
>
> [2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
>