[PATCH 2/5] d3dcompiler: Introduce a separate structure for lvalue derefs.

Tue Apr 28 06:38:05 CDT 2020

On Sun, Apr 19, 2020 at 11:06 PM Zebediah Figura
<zfigura at codeweavers.com> wrote:
>
> On 4/17/20 5:43 PM, Matteo Bruni wrote:
> > I see your point. I guess a possibility opened by the flat derefs is
> > that you could share intermediate derefs, although that might well
> > make things more complicated for no benefit.
>
> I mean, it could potentially be done. I guess the easiest way that
> occurs to me is to set a flag that just states whether a HLSL_IR_DEREF
> is read from...
>
> >> I think I understand why flattening is desirable in general, but it's
> >> not clear to me how that would apply to lvalues (or even to rvalue deref
> >> chains, honestly), since at codegen time we essentially resolve
> >> arbitrarily complex deref chains into one register (well, except for
> >> non-constant array indexing...)
> >
> > True, but the same can be said in principle for any other instruction.
>
> What I meant to say is that an expression like "(a + b) * c" is
> necessarily two instructions, whereas "a.b.c.<...>.e = f" is always one,
> regardless of the length of the deref chain.

Not if that's e.g. a matrix assignment, that needs to become multiple
mov instructions at some point.
Stretching it a bit, the former can potentially be just one instruction (MAD).

> > Also if you always "go flat" you have only one way to follow sources,
> > instead of having to mix tree traversing inside a single instruction
> > with instruction traversing.
> > Relatedly, if we start doing more invasive IR transformations, having
> > simple, "atomic" instructions might make things easier. E.g. with
> > structure splitting, I think you'd end up with replacing one
> > instruction (as in one instruction list entry) without changing the
> > rest of the IR, while in the other case you have to replace a piece of
> > a larger instruction which might be buried arbitrarily deep in the
> > tree.
> > Admittedly this is all very abstract and I might not be seeing a
> > sensible overall picture.
> >
>
> I guess the way I look at it is, emitting an assignment instruction
> means we have to reach up the entire chain at codegen time anyway, i.e.
> I'm not sure how you can emit an instruction for each individual lvalue
> deref node, whereas you can emit (at least) one instruction for each
> node currently.

My point is that, during compilation, we probably have to traverse the
instruction list to lookup sources for non-deref nodes anyway. At that
point, if derefs are also full instructions / nodes, you get those for
free. If, on the other hand, derefs are stored in an arbitrarily deep
tree inside other instructions, we need to have a way to traverse
derefs in addition to the "non-derefs" instructions. That's just me
guessing though.

> To be clear, I don't think it's awful to emit nodes for lvalue derefs,
> I'm just not convinced it's better—but I'll defer to the maintainer in
> any case.

Well, sure, but what I'm doing is mostly picturing things in my head,
which doesn't always work right.

> >> It's also not obvious to me that's prettier to flatten it, given on the
> >> other hand we're also now inserting nodes into the IR list that would
> >> presumably not participate in RA or generate code.
> >
> > Right, but that's a bit like doing an optimization right while parsing.
> > I guess we want lvalues and rvalues / derefs to either be both flat or
> > both tree-like. Let's focus on rvalues for a moment.
> > Imagine you have a complicated deref chain, eventually picking a
> > single float out of something larger and complex. If you have a very
> > naive compiler you can imagine that each step of the deref is
> > implemented as a copy of whatever the result of the deref step is into
> > a temporary variable. This might mean e.g. temporarily copying a full
> > matrix before getting at the individual component. Incredibly
> > inefficient, but simple and consistent.
> > Obviously this isn't very interesting and can be simplified right
> > away... except when it can't, like the non-constant array indexing
> > case you mention. Or copying a whole input structure to an output
> > variable.
> > For lvalues you can imagine the specular thing, where the naive
> > compiler stores results into a temporary variable and then copies that
> > into the actual destination.
> >
> > It seems to me that it would be nicer to have a simple and flat IR
> > structure where each entry can do very little, or even nothing, in
> > practice, and handle simplifications and IR restructuring over a
> > number of passes rather than the opposite. Somewhat related, you can
> > already get instructions that eventually do nothing of consequence
> > (e.g. dead code, swizzling into the exact same thing, ...) and I don't
> > think that has to be necessarily that different from the compiler's
> > point of view.
>
> So I'm not entirely sure from this if you have something specific in
> mind, but I'll at least run through my thoughts on the matter.
>
> The compiler that exists in at least two of my local trees does emit a
> copy for every HLSL_IR_DEREF, i.e. from a variable into a temp node.
> This is awful, inasmuchas it matters, but the bottom line is by design,
> HLSL_IR_DEREF is essentially completely redundant, as you say.
>
> One way to get around this—and the approach implied if we do make each
> deref an explicit node—is to reach through all the derefs at codegen
> time and grab the relevant register. We'd skip actually emitting code
> for any deref node. Hence an expression like "func().b.c + 1" would
> generate IR like
>
> 2: func()
> 3: @2.b
> 4: @3.c
> 5: 1
> 6: @4 + @5
>
> Inlining aside, we'd only allocate registers for 2, 5, 6, and the source
> to node 6 would offset the register allocated to node 2 by the offset of
> "b.c".
>
> Note that liveness analysis would also have to recursively reach through
> deref nodes.
>
> The advantage of this, I guess, is that if we really do want to make
> lvalues separate nodes, then rvalues should probably be separate nodes too.

Right. One way to see it is that this is essentially a specialized
form of copy propagation that only applies to derefs (which is
admittedly the most important case, but not the only one). If we have
actual copy (or maybe even constant) propagation we become able to
'skip' register allocation for other node types too. I already
mentioned swizzling as something can do without a separate instruction
and register, but it's not limited to that.

> The second alternative that I was looking at was to make this a bit more
> explicit, and essentially let the hlsl_src structure contain a union of
> hlsl_ir_node and hlsl_deref (note that hlsl_deref.array/record would
> then have to be a pointer to hlsl_src instead of the flat structure as
> patch 0005 has it). Then we'd essentially generate
>
> 2: func()
> 3: 1
> 4: @2.b.c + 3
>
> I mildly prefer this, since I like having the IR language be explicit
> about what's allowed and to resemble the code we produce, and it's not
> difficult to form this even at parse time. We can't easily get rid of
> HLSL_IR_DEREF entirely here, but we can guarantee it doesn't occur
> except at parse time (plus maybe a dce run?)

I guess the downside of this approach is that derefs are a special
case to some degree. Which isn't necessarily bad but it feels a little
limiting, as in my comment above.

> A third alternative is to adopt my patches 010* from my 6 April mail,
> i.e. to make both lvalue and rvalue derefs explicitly chains of derefs,
> and synthesize variables when necessary. That doesn't preclude folding
> hlsl_deref into hlsl_src, of course.

Right. I feel this ends up conflating a few things together though.
Representing lvalues and rvalues as chains of (only) derefs is an IR
policy; synthesizing temporary variables is the mechanism; The
relation between hlsl_deref and hlsl_src is the "enforcement". Those
are all good but they are separate pieces of the puzzle. E.g. as I
mentioned earlier, synthesizing variables is a mechanism that will
probably come in handy for one reason or another even if we decide to
go with a different route for derefs.

I have another proposal, as a way to sidestep the whole deref story.
As usual, with the caveat that this is all armchair planning and it is
known that plans rarely survive contact with the enemy. Anyway...

With the current IR we end up having arbitrarily complex deref chains
to represent the composition of multiple levels of array and struct
"indexing", which are necessarily cumbersome to handle. What if we
transform all of that into offsetting instead? An arbitrary variable
dereference would be represented as (variable, offset), with offset an
arbitrary expression with an integer scalar value. That requires that
we know the layout of complex data types early on, which now we can
do, I think. The offsets could be in scalar or vec4 units, I can see
benefits for either choice. Ideally we would directly generate offsets
right during parsing and skip derefs altogether, although that might
not be practical.
An obvious downside coming from this is that constant folding becomes
pretty much required, to compute a register index at codegen time. It
does have the advantage that non-constant offsets are no different
from constant ones as far as IR is concerned. Obviously we can skip
supporting those initially, regardless of all this.

What do you think? I have a complication right away, which is the fact
that in SM3 struct variables can be allocated into multiple register
sets (see http://www.winehq.org/pipermail/wine-devel/2013-August/100705.html).
I don't think it's a blocker but it is annoying...