HLSL offsetting

Thu Jun 9 17:38:58 CDT 2022

On Thu, Jun 9, 2022 at 9:17 PM Zebediah Figura <zfigura at codeweavers.com> wrote:
>
> On 6/9/22 04:04, Matteo Bruni wrote:
> >> The ugliness that we've run into is: how do we emit IR for the following
> >> variable load?
> >>
> >>       struct apple
> >>       {
> >>           int a;
> >>           struct
> >>           {
> >>               Texture2D b;
> >>               int c;
> >>           } s;
> >>       } a;
> >>
> >>       /* in some expression */
> >>       func(a.s);
> >>
> >> Unlike the SM1 example above, the register numbers don't match up.
> >> Separately, it's kind of ugly that backend-specific details regarding
> >> register size and alignment are leaking into the frontend so much.
> >
> > I think most of that can be hidden or contained with some proper
> > abstraction. And generous handwaving.
> > But basically, that probably could be represented in the IR as copying
> > around individual fields of the structure separately, rather than a
> > single "struct deref". Clearly it can become more complex depending on
> > the type of the variable but I think it should be doable.
>
> Yeah, it could. Like I said it's not prohibitive. I'm just not sure it's
> the best option at this point.
>
> It's worth pointing out that, at parse time, we want and need for load
> instructions (and therefore probably also store instructions) to have
> larger-than-vector types—that is, load instructions can produce structs,
> and store instructions can consume them. But we don't want that for
> SMxIR, and I believe we don't want that for the "final form" of HLSL IR
> either. That's the way the code is currently arranged and I see no
> reason not to keep it that way.

Right, and agreed, but that's currently taken care of by
split_struct_copies() and I don't think that's a (large) problem for
the register offset approach.

> >> Similarly, the amount of code that has to deal with matrix majority is
> >> unfortunate.
> >
> > That personally seems more annoying. Although it's not clear to me
> > that handling matrix majority at a later stage is necessarily any
> > better.
>
> The main idea is that we could handle it something closer to once (well,
> once per backend), at HLSL -> SMx translation.
>
> That doesn't necessarily mean requiring that all matrix loads and stores
> are done on a single scalar—after all, we could translate a single
> vector load to multiple MOV instructions if it can't actually be
> represented by one.
>
> It does potentially mean doing vectorization passes on SMxIR, though.
> Hard to tell this far in advance, and it's also hard to tell if that's
> something we're going to need anyway.

Yeah, that's the sort of possible snag that I was thinking about WRT
moving the matrix majority handling "down".

> >> The former problem can potentially be solved by embedding multiple
> >> register offsets into hlsl_deref (one per register type). Neither this
> >> nor the latter problem are prohibitive, and I was at one point in favour
> >> of continuing to use register offsets everywhere, but at this point my
> >> feeling has changed, and I think using register offsets is looking more
> >> ugly than the alternatives. I get the impression that Francisco
> >> disagrees, though, which is why we should probably hash this out now.
> >
> > As I mention below, I currently see two options as the most appealing.
> > This one (multiple register offsets) sits somewhat in the middle and
> > it feels like it would be best to go to one of the extremes instead.
> > It's also possible that this middle ground solution would end up being
> > nicer in practice. At any rate, I certainly wouldn't flat out discount
> > it.
> >
> >> Nor do I think we should use both register offsets and component offsets
> >> (either in the same node type, or in different node types). That just
> >> makes the IR way more complicated. Rather, I think we should be doing
> >> everything in *just* component offsets until translation from HLSL IR to
> >> SMx IR.
> >
> > I touched on this earlier and I agree that the additional complexity
> > is unlikely to be worth it. Admittedly we're in a limbo right now
> > where SMxIR isn't quite there yet, which makes reasoning on some of
> > these details a bit fuzzy.
> >
> >> In order to deal with the problem of translating dynamic offsets from
> >> components to registers, I see three options:
> >>
> >> (a) emit code at runtime, or do some sophisticated lowering,
> >>
> >> (b) use special offsetof and sizeof nodes,
> >>
> >> (c) introduce a structured deref type, much like [1]. Francisco was
> >> actually proposing something like this, although with an array instead
> >> of a recursive structure, which strikes me as an improvement.
> >>
> >> My guess is that (a) is very hard. I haven't really tried to reason it
> >> out, though.
> >>
> >> Given a choice between (b) and (c), I'm more inclined to pick (c). It
> >> makes the IR structure more restrictive, and those restrictions
> >> fundamentally match the structured nature of the language we're working
> >> with, both things I tend to like.
> >
> > After giving it some thought I think that's certainly fine *for the
> > higher level IR*. At the same time it seems to me that, if we go that
> > route, eventually we also want to have real SMxIR with register
> > offsets, and make sure that we can optimize constant offsets (thus
> > expressions) at that level.
> >
> > As I see it (as of current time and date, can't guarantee that I won't
> > change my mind again...) we either push the backend-specific info up
> > (register offsets all the way) or down (component offsets with
> > structured deref / type info in the generic IR, transformation into
> > register offsets in the SMxIR). I think either option works and it's
> > mostly a matter of preference and which one fits / feels better with
> > the rest of the compiler.
>
> Yeah, that general approach makes sense to me. And yes, of course the
> SMxIR should deal entirely in register offsets.
>
> My current vision of SMxIR is that it should be a one-to-one
> representation of actual instructions, writable without any lowering
> passes (and hence any passes that are done on it should be optimization
> only, with the *possible* exception of RA.) In a sense, it's what we
> have already with sm4_instruction and such, except that we'd be storing
> it and doing passes on it rather than just writing it directly.

Right, I agree with the general idea. In practice it might turn out to
be useful to relax the 1:1 requirements a bit and introduce some
"extra" instructions (that would be quickly lowered to real ones) if
that makes the HIR->SMxIR conversion easier.

Something related we already discussed: SM1IR and SM4IR are going to
be different, obviously, but we want to try and architect them so that
the optimization passes machinery and most of the actual passes can be
shared between the two.

> Between those two extremes—well, what we currently have basically *is*
> the first extreme, with register offsets pushed all the way up to parse
> time. It's just causing some friction that makes me think the latter
> extreme is probably going to be pretty.

It is, but it sort of happened by accident. My guess is that an
explicit effort to "clean up" the compiler around register offsets
from the top might lessen some of that friction.

It's fair to say that, at this point, we probably have more visibility
into the latter, which arguably makes that option more compelling.

> >> Note that either way we're going to need specialized functions to
> >> resolve deref offsets in one step. I also think that should depend on
> >> the domain—e.g. for copy-prop we'll actually want to do everything in
> >> component counts, but when translating to SMxIR we'll evaluate given the
> >> register alignment constraints of the shader model. In the case of (b)
> >> it's not going to be as simple as running the existing constant folding
> >> pass, because we can't actually fold the sizeof/offsetof constants
> >> (unless we dup the node list, evaluate, and then fold, which seems very
> >> hairy and more work than the alternative).
> >
> > Right, each option will have different tradeoffs WRT optimization
> > passes. But e.g. copy-prop should be doable even with register
> > offsets, we "just" need to make sure to always map the component
> > offsets to their respective register offsets.
>
> Quite, in fact we're already doing it that way. But it's probably better
> to work with components, since we (a) don't waste space tracking padding
> [not very important], and (b) don't have to deal with multiple register
> sets [more important].

Yeah. I guess my point would have been better served by pointing out
passes that become easier or more powerful with register offsets, like
probably vectorization.

> >> I invite thoughts—especially from Matteo, since we discussed this sort
> >> of problem ages ago.
> >
> > Yep, hope that my comments make sense. I want to hear from the others too.
> >
> >>
> >> ἔρρωσθε,
> >> Zeb
> >>
> >>
> >> [1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
> >>
> >> [2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
> >>