[PATCH 2/5] d3dcompiler: Introduce a separate structure for lvalue derefs.

Tue Apr 28 16:22:02 CDT 2020

On Tue, Apr 28, 2020 at 10:47 PM Zebediah Figura
<zfigura at codeweavers.com> wrote:
>
> On 4/28/20 6:38 AM, Matteo Bruni wrote:
> > On Sun, Apr 19, 2020 at 11:06 PM Zebediah Figura
> > <zfigura at codeweavers.com> wrote:
> >>
> >> On 4/17/20 5:43 PM, Matteo Bruni wrote:
> >>> I see your point. I guess a possibility opened by the flat derefs is
> >>> that you could share intermediate derefs, although that might well
> >>> make things more complicated for no benefit.
> >>
> >> I mean, it could potentially be done. I guess the easiest way that
> >> occurs to me is to set a flag that just states whether a HLSL_IR_DEREF
> >> is read from...
> >>
> >>>> I think I understand why flattening is desirable in general, but it's
> >>>> not clear to me how that would apply to lvalues (or even to rvalue deref
> >>>> chains, honestly), since at codegen time we essentially resolve
> >>>> arbitrarily complex deref chains into one register (well, except for
> >>>> non-constant array indexing...)
> >>>
> >>> True, but the same can be said in principle for any other instruction.
> >>
> >> What I meant to say is that an expression like "(a + b) * c" is
> >> necessarily two instructions, whereas "a.b.c.<...>.e = f" is always one,
> >> regardless of the length of the deref chain.
> >
> > Not if that's e.g. a matrix assignment, that needs to become multiple
> > mov instructions at some point.
> > Stretching it a bit, the former can potentially be just one instruction (MAD).
>
> Sure, I guess I'm assuming that we've already gone through splitting
> passes so that's not a concern.
>
> >>> Also if you always "go flat" you have only one way to follow sources,
> >>> instead of having to mix tree traversing inside a single instruction
> >>> with instruction traversing.
> >>> Relatedly, if we start doing more invasive IR transformations, having
> >>> simple, "atomic" instructions might make things easier. E.g. with
> >>> structure splitting, I think you'd end up with replacing one
> >>> instruction (as in one instruction list entry) without changing the
> >>> rest of the IR, while in the other case you have to replace a piece of
> >>> a larger instruction which might be buried arbitrarily deep in the
> >>> tree.
> >>> Admittedly this is all very abstract and I might not be seeing a
> >>> sensible overall picture.
> >>>
> >>
> >> I guess the way I look at it is, emitting an assignment instruction
> >> means we have to reach up the entire chain at codegen time anyway, i.e.
> >> I'm not sure how you can emit an instruction for each individual lvalue
> >> deref node, whereas you can emit (at least) one instruction for each
> >> node currently.
> >
> > My point is that, during compilation, we probably have to traverse the
> > instruction list to lookup sources for non-deref nodes anyway. At that
> > point, if derefs are also full instructions / nodes, you get those for
> > free. If, on the other hand, derefs are stored in an arbitrarily deep
> > tree inside other instructions, we need to have a way to traverse
> > derefs in addition to the "non-derefs" instructions. That's just me
> > guessing though.
>
> I don't particularly understand this assertion (despite our conversation
> in IRC about it). Maybe you could describe a specific optimization that
> would be helped by this, if you have any in mind?

My thought was that any pass that replaces an instruction might have
to also look at the sources. But during our IRC conversation we came
up with at least two ways to avoid that, so maybe not.
Passes that attempt to combine instructions (like the MAD case I
mentioned earlier) might use this kind of traversal. But again, after
more thinking it seems like it should also be possible without.

> Fundamentally I can understand that things have to recursively reach
> through deref nodes (well, unless we collapse them down to a single
> offset, like your suggestion below), but it's not clear to me that this
> is necessary for anything but struct or array derefs (and in that case,
> I don't see a lot of problem with treating them as a special case).

Certainly not a lot of problem, but it would be nicer to avoid the
recursive search entirely, if possible.

> Not that this is hugely important, though, if I pursue the offsetting
> approach and it works.
>
> >> To be clear, I don't think it's awful to emit nodes for lvalue derefs,
> >> I'm just not convinced it's better—but I'll defer to the maintainer in
> >> any case.
> >
> > Well, sure, but what I'm doing is mostly picturing things in my head,
> > which doesn't always work right.
> >
> >>>> It's also not obvious to me that's prettier to flatten it, given on the
> >>>> other hand we're also now inserting nodes into the IR list that would
> >>>> presumably not participate in RA or generate code.
> >>>
> >>> Right, but that's a bit like doing an optimization right while parsing.
> >>> I guess we want lvalues and rvalues / derefs to either be both flat or
> >>> both tree-like. Let's focus on rvalues for a moment.
> >>> Imagine you have a complicated deref chain, eventually picking a
> >>> single float out of something larger and complex. If you have a very
> >>> naive compiler you can imagine that each step of the deref is
> >>> implemented as a copy of whatever the result of the deref step is into
> >>> a temporary variable. This might mean e.g. temporarily copying a full
> >>> matrix before getting at the individual component. Incredibly
> >>> inefficient, but simple and consistent.
> >>> Obviously this isn't very interesting and can be simplified right
> >>> away... except when it can't, like the non-constant array indexing
> >>> case you mention. Or copying a whole input structure to an output
> >>> variable.
> >>> For lvalues you can imagine the specular thing, where the naive
> >>> compiler stores results into a temporary variable and then copies that
> >>> into the actual destination.
> >>>
> >>> It seems to me that it would be nicer to have a simple and flat IR
> >>> structure where each entry can do very little, or even nothing, in
> >>> practice, and handle simplifications and IR restructuring over a
> >>> number of passes rather than the opposite. Somewhat related, you can
> >>> already get instructions that eventually do nothing of consequence
> >>> (e.g. dead code, swizzling into the exact same thing, ...) and I don't
> >>> think that has to be necessarily that different from the compiler's
> >>> point of view.
> >>
> >> So I'm not entirely sure from this if you have something specific in
> >> mind, but I'll at least run through my thoughts on the matter.
> >>
> >> The compiler that exists in at least two of my local trees does emit a
> >> copy for every HLSL_IR_DEREF, i.e. from a variable into a temp node.
> >> This is awful, inasmuchas it matters, but the bottom line is by design,
> >> HLSL_IR_DEREF is essentially completely redundant, as you say.
> >>
> >> One way to get around this—and the approach implied if we do make each
> >> deref an explicit node—is to reach through all the derefs at codegen
> >> time and grab the relevant register. We'd skip actually emitting code
> >> for any deref node. Hence an expression like "func().b.c + 1" would
> >> generate IR like
> >>
> >> 2: func()
> >> 3: @2.b
> >> 4: @3.c
> >> 5: 1
> >> 6: @4 + @5
> >>
> >> Inlining aside, we'd only allocate registers for 2, 5, 6, and the source
> >> to node 6 would offset the register allocated to node 2 by the offset of
> >> "b.c".
> >>
> >> Note that liveness analysis would also have to recursively reach through
> >> deref nodes.
> >>
> >> The advantage of this, I guess, is that if we really do want to make
> >> lvalues separate nodes, then rvalues should probably be separate nodes too.
> >
> > Right. One way to see it is that this is essentially a specialized
> > form of copy propagation that only applies to derefs (which is
> > admittedly the most important case, but not the only one). If we have
> > actual copy (or maybe even constant) propagation we become able to
> > 'skip' register allocation for other node types too. I already
> > mentioned swizzling as something can do without a separate instruction
> > and register, but it's not limited to that.
> >
> >> The second alternative that I was looking at was to make this a bit more
> >> explicit, and essentially let the hlsl_src structure contain a union of
> >> hlsl_ir_node and hlsl_deref (note that hlsl_deref.array/record would
> >> then have to be a pointer to hlsl_src instead of the flat structure as
> >> patch 0005 has it). Then we'd essentially generate
> >>
> >> 2: func()
> >> 3: 1
> >> 4: @2.b.c + 3
> >>
> >> I mildly prefer this, since I like having the IR language be explicit
> >> about what's allowed and to resemble the code we produce, and it's not
> >> difficult to form this even at parse time. We can't easily get rid of
> >> HLSL_IR_DEREF entirely here, but we can guarantee it doesn't occur
> >> except at parse time (plus maybe a dce run?)
> >
> > I guess the downside of this approach is that derefs are a special
> > case to some degree. Which isn't necessarily bad but it feels a little
> > limiting, as in my comment above.
> >
> >> A third alternative is to adopt my patches 010* from my 6 April mail,
> >> i.e. to make both lvalue and rvalue derefs explicitly chains of derefs,
> >> and synthesize variables when necessary. That doesn't preclude folding
> >> hlsl_deref into hlsl_src, of course.
> >
> > Right. I feel this ends up conflating a few things together though.
> > Representing lvalues and rvalues as chains of (only) derefs is an IR
> > policy; synthesizing temporary variables is the mechanism; The
> > relation between hlsl_deref and hlsl_src is the "enforcement". Those
> > are all good but they are separate pieces of the puzzle. E.g. as I
> > mentioned earlier, synthesizing variables is a mechanism that will
> > probably come in handy for one reason or another even if we decide to
> > go with a different route for derefs.
> >
> >
> > I have another proposal, as a way to sidestep the whole deref story.
> > As usual, with the caveat that this is all armchair planning and it is
> > known that plans rarely survive contact with the enemy. Anyway...
> >
> > With the current IR we end up having arbitrarily complex deref chains
> > to represent the composition of multiple levels of array and struct
> > "indexing", which are necessarily cumbersome to handle. What if we
> > transform all of that into offsetting instead? An arbitrary variable
> > dereference would be represented as (variable, offset), with offset an
> > arbitrary expression with an integer scalar value. That requires that
> > we know the layout of complex data types early on, which now we can
> > do, I think. The offsets could be in scalar or vec4 units, I can see
> > benefits for either choice. Ideally we would directly generate offsets
> > right during parsing and skip derefs altogether, although that might
> > not be practical.
> > An obvious downside coming from this is that constant folding becomes
> > pretty much required, to compute a register index at codegen time. It
> > does have the advantage that non-constant offsets are no different
> > from constant ones as far as IR is concerned. Obviously we can skip
> > supporting those initially, regardless of all this.
>
> I like this inasmuch as it's basically "hlsl_deref chains, but simpler";
> I guess it requires no recursion anywhere, and so kind of sidesteps the
> constraints that led us here in the first place.

Yeah, I didn't really love any of the options coming from derefs so
decided to throw them away and see what happens :P

> > What do you think? I have a complication right away, which is the fact
> > that in SM3 struct variables can be allocated into multiple register
> > sets (see http://www.winehq.org/pipermail/wine-devel/2013-August/100705.html).
> > I don't think it's a blocker but it is annoying...
>
> The way I'd solve that is just to allow for potentially multiple
> hlsl_reg structures to any given uniform. That could be as simple as
> including three separate hlsl_reg structures inside hlsl_var. sm2-3
> apparently allocates structs by limiting the size to include the last
> used element (whereas prior used elements remain unused). That'd just
> have to be done per type.

Hmm, not sure if it does allocate "unnecessary" elements but, if
that's the case, that's quite nice indeed since it means offsets (when
counting in units of actual SM2 registers) are the same regardless of
the register set.

> I'm admittedly not sure how changing our representation of derefs
> affects this specific problem, but maybe I'm missing something.

It doesn't seem insurmountable either way, just something that's maybe
a bit more of a hassle with offsets vs derefs. Or not, if the above is
how it works.

> To my simultaneous delight and chagrin I can't come up with anything
> else that this approach would break.

I call it a success! Until you find it while updating the code :D