[PATCH vkd3d 2/5] vkd3d-shader/hlsl: Perform a copy propagation pass.

Wed Nov 10 10:33:21 CST 2021

On 11/10/21 03:57, Giovanni Mascellani wrote:
> Hi,
> 
> On 09/11/21 23:21, Zebediah Figura wrote:
>> Overall this patch looks a lot better than I was afraid of. It's a lot 
>> of code that's intimidating to review, but once you ignore the rbtree 
>> boilerplate it's simple enough and seems about in line with what I 
>> expect. There's quite a few things that I think can be simplified, but 
>> the basic structure seems sound, so nice work.
> 
> Great, that's good! Thanks for your review, I think most if not all of 
> the points you wrote should be easy to fix.
> 
>> The CF part is probably going to be a lot trickier, but fortunately I 
>> think that this patch alone is the one that really matters. Frankly, 
>> if we were to take this patch, and then a second patch that does 
>> copy-prop on interior CF blocks with a fresh copy_propagation_state(), 
>> we might even cover enough that it's not even worrying about CF...
> 
> Mmh, note that you have to handle control flow in some way or another, 
> you cannot just ignore it. This patch doesn't know about control flow, 
> but the only way it can do it is by bailing out at the first control 
> flow instruction in the program. It's not enough to ignore the control 
> flow block and resume processing after it, because the inner block might 
> invalidate some of your knowledge about the program state (in 3/5 patch 
> this is done by copy_propagation_invalidate_from_block).

Right. I should have said to invalidate the whole copyprop state both 
when entering a CF block and when leaving one. The point is that we can 
do copyprop within every BB, not just the first one, without having to 
worry about things like partial invalidation.

> 
>>> +/* struct copy_propagation_state represents the accumulated knowledge
>>> + * of the copy propagation pass while it scans through the code. Field
>>> + * "variables" is a tree whose elements have type struct
>>> + * copy_propagation_varible, and represent each of the variables the
>>> + * pass has already encountered (except those with special semantics,
>>> + * which are ignored). For each variable, the array "values" (whose
>>> + * length is the register size of the variable) represent which node
>>> + * and which index inside that node (i.e., which of the at most four
>>> + * entries of a vector) provided that value last time. Field "node"
>>> + * can be NULL, meaning that the pass was not able to statically
>>> + * determine the node.
>>> + */
>>
>> This comment feels a bit too low-level? I dunno, comments like this 
>> are hard to review, but my inclination is to describe what you're 
>> doing at a high level, along the lines of "we track the last known 
>> value of each component of each variable", and let the code itself 
>> explain the details.
> 
> Personally I would have loved to meet comments like mine in other areas 
> of Wine: knowing what data read or written by an algorithm are meant to 
> represent makes it much easier to understand why the algorithm does what 
> it does. I agree that this is not the most complicated case around, but 
> I thought this would be helpful for someone not already knowing my code. 
> OTOH I understand that also a high-level description is useful: would 
> you consider it satisfying to have both?

Eh, maybe. I would at least start with the high-level explanation. When 
I read the above comment my eyes kind of glaze over, and it doesn't tell 
me anything that the code already doesn't (even the struct definitions 
alone tell me as much without checking how they're used.)

I dunno, I'll see what you end up coming up with.

BTW, I notice a spelling error ("copy_propagation_varible").

> 
>>> +static int copy_propagation_variable_compare(const void *key, const 
>>> struct rb_entry *entry)
>>> +{
>>> +    struct copy_propagation_variable *variable = 
>>> RB_ENTRY_VALUE(entry, struct copy_propagation_variable, entry);
>>> +    uintptr_t key_int = (uintptr_t)key, entry_int = 
>>> (uintptr_t)variable->var;
>>> +
>>> +    if (key_int < entry_int)
>>> +        return -1;
>>> +    else if (key_int > entry_int)
>>> +        return 1;
>>> +    else
>>> +        return 0;
>>
>> Could we just modify the rbtree implementation to use uintptr_t 
>> instead, and then do a direct subtraction?
> 
> I am not sure of what you mean: the pointer in struct 
> copy_propagation_variable is used as a pointer, it would not be very 
> logical to store it as an integer.
> 
> Also, do you get the right think if you subtract two unsigned integers? 
> For one thing the result is unsigned, but even if you cast it to signed 
> you don't get an order, do you?
> 
> Even if you could, you would get a shorter code, but I am not sure it 
> would be a more legible one.

Er, sorry, I should have said intptr_t. I mean to return intptr_t from 
the rbtree compare callback, and then all you need is

     return (intptr_t)key - (intptr_t) variable->var;

> 
>>> +static struct copy_propagation_variable 
>>> *copy_propagation_get_variable(struct hlsl_ctx *ctx,
>>> +        struct copy_propagation_state *state, struct hlsl_ir_var *var)
>>> +{
>>> +    struct rb_entry *entry = rb_get(&state->variables, var);
>>> +    struct copy_propagation_variable *variable;
>>> +    int res;
>>> +
>>> +    if (entry)
>>> +        return RB_ENTRY_VALUE(entry, struct 
>>> copy_propagation_variable, entry);
>>> +
>>> +    variable = hlsl_alloc(ctx, sizeof(*variable));
>>> +    if (!variable)
>>> +        return NULL;
>>> +
>>> +    variable->var = var;
>>> +    variable->values = hlsl_alloc(ctx, sizeof(*variable->values) * 
>>> var->data_type->reg_size);
>>> +    if (!variable->values)
>>> +    {
>>> +        vkd3d_free(variable);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    res = rb_put(&state->variables, var, &variable->entry);
>>> +    assert(!res);
>>
>> Although this is a bit awkward because most of the function is only 
>> relevant for stores, not loads, and I think it would actually be 
>> better to reflect that in the code, one way or another (so we can do 
>> less effort).
> 
> Do less effort at what? It's true that when loading I could return NULL 
> when the variable has never been seen before (meaning that we don't know 
> anything about that variable, so the load cannot be removed) instead of 
> creating an "empty" variable placeholder and return that one, but I 
> don't really see an advantage doing that. It would seem more effort 
> rather than less to me, because there are more possible branches to check.

Less effort done by the code, I mean, i.e. we use less memory that way 
and spend less time searching it. It requires a bit more code, but I 
think that if you slightly reorganize the callbacks it's still perfectly 
idiomatic.

> 
> (BTW, that would also mean that the variable is read before being 
> written, which is something that shouldn't normally happen, but I don't 
> think it's sensible to count on that)

I don't understand what you mean by this; can you please clarify?

> 
>>> +            if (TRACE_ON())
>>> +            {
>>> +                char buf[32];
>>> +                if (!node)
>>> +                    sprintf(buf, "(nil)");
>>
>> Can that happen?
> 
> Not yet, but it will as soon as loops and conditionals enter the scene, 
> because they can cause the last store to a variable to become unknown 
> (at compilation time).

Ah, of course, that makes sense.

> 
> Now, I know that usually we don't like to have dead code around. I think 
> this case could be accepted, since this code is just a line long and 
> becomes alive at 3/5. Also, I find it easier to review this piece of 
> code at once, instead of first reviewing a part, checking that the 
> missing case will never happen and then review the missing case and 
> checking that it fits well in the already-reviewed part.
> 
> But if you really don't like it, I can defer the NULL case to 3/5.

Yeah, even here I find it easier to review if this code is moved to 3/5.

> 
>>> +                else if (node->index)
>>> +                    sprintf(buf, "@%u", node->index);
>>> +                else
>>> +                    sprintf(buf, "%p", node);
>>
>> Can that happen?
> 
> Definitely. Arguably, it might be the common case, given that the 
> earlier optimization passes (including this pass itself, given that it 
> might be executed more than once) might synthesize a lot of nodes, which 
> won't have an index until compute_liveness is executed.
> 
Ah. I assumed that you would actually be indexing right before copyprop, 
and didn't check whether that was actually the case.

Although on reflection now I'm not sure if we *should* index. We only 
dump the IR once, at the end, and I feel like we risk confusion by 
printing instruction indices that won't match. Hmm, maybe the code makes 
sense as it is...

>>> +                TRACE("variable %s[%d] is written by %p[%d]\n", 
>>> variable->var->name, offset + index, buf, index);
>>
>> This trace doesn't really match the usual format, but more saliently, 
>> the hardcoded buffer is ugly. Assuming the other two cases really can 
>> happen, I'd rather see individual traces spelled out.
>>
>> Same thing below.
> 
> Ok. As usual, I found find that uglier, but I can live with that.
> 
> BTW, having the traces is not a condicio sine qua non for me. They were 
> useful for me to develop the patch, and I left them because I figured 
> that they might also be useful in the future, but we can happily get rid 
> of them.

I'm typically in favor of leaving in traces if they were helpful during 
debugging.

For what it's worth, I'd also be in favor of a "debugstr_node" helper or 
something. Maybe even if it uses a fixed-size buffer, although I'm not 
sure that Matteo or Henri will agree with that ;-)

> 
>>> +    struct hlsl_deref *src = &load->src;
>>> +    struct hlsl_ir_var *var = src->var;
>>> +    struct hlsl_ir_swizzle *swizzle;
>>> +    DWORD s;
>>> +
>>> +    if (var->is_input_semantic || var->is_output_semantic || 
>>> var->is_uniform)
>>> +        return false;
>>
>> For input semantics and uniforms: yeah, but do we need to? The 
>> important question is "can we reconstruct a store for this variable", 
>> and this should already be false.
>>
>> For output semantics, we should never get here in the first place.
> 
> I think you're right. I had put this check initially because I was 
> concentrated on other aspects of the algorithm, but it's true that it is 
> overly cautious.
> 
>>> +
>>> +    if (type->type != HLSL_CLASS_SCALAR && type->type != 
>>> HLSL_CLASS_VECTOR)
>>> +        return false;
>>> +
>>> +    offset = hlsl_offset_from_deref(src);
>>
>> The problem with hlsl_offset_from_deref(), and the reason I probably 
>> should have fought against 62b25bc52b in its current form, is that it 
>> not only doesn't deal with non-constant offsets, but doesn't really 
>> give the caller a way to bail either. We should probably be returning 
>> bool from it, and then aborting here.
> 
> I agree. Actually, the first version of my patch set had a variant of 
> hlsl_offset_from_deref that acknowledged the possibility that it could 
> not be possible to statically determine the offset of a deref. This 
> condition is difficult to handle at code generation time, but it is not 
> particularly problematic here: at load time just ignore the load, and at 
> store time invalidate the whole variable.
> 
> I can totally fix hlsl_offset_from_deref so that this happens.
> 
> (speaking of constant expressions in the code, I am a bit bothered that 
> evaluate_array_dimension is not able to figure out the value of constant 
> expressions that are not immediate constants, and unfortunately fixing 
> that is not as easy as running a fold_constants pass; I don't in mind a 
> clean solution for that that doesn't require re-implementing constant 
> folding there; that's another matter, of course)

Yeah, that one is worrying...

I think we might have to reimplement constant folding. Probably we can 
at least use a common helper, though.

> 
>>> +
>>> +    variable = copy_propagation_get_variable(ctx, state, var);
>>> +    if (!variable)
>>> +        return false;
>>> +
>>> +    new_node = copy_propagation_reconstruct_node(variable, offset, 
>>> type->dimx, indices);
>>> +
>>> +    if (TRACE_ON())
>>> +    {
>>> +        char buf[32];
>>> +        if (!new_node)
>>> +            sprintf(buf, "(nil)");
>>
>> Is this useful to trace?
> 
> As before, it is a matter of tastes, I believe. I can drop it, just as I 
> can drop the traces above.

If it helps, feel free to leave it in. FWIW, I'm mostly talking about 
the "null" case, not the others.

> 
>>> +        else if (new_node->index)
>>> +            sprintf(buf, "@%u", new_node->index);
>>> +        else
>>> +            sprintf(buf, "%p", new_node);
>>
>> Can this happen?
> 
> Yes, just as above.
> 
>>> +    variable = copy_propagation_get_variable(ctx, state, var);
>>> +    if (!variable)
>>> +        return false;
>>> +
>>> +    copy_propagation_set_value(variable, 
>>> hlsl_offset_from_deref(lhs), store->writemask, store->rhs.node);
>>> +
>>> +    return false;
>>> +}
>>
>> Shouldn't this function just return void?
> 
> It's a matter of philosophy. If you ask me, the switch in 
> copy_propagation_recursive should be rather oblivious of what the 
> various helpers actually do, it shouldn't have an insight that 
> copy_propagation_store never returns true. For its point of view, any 
> helper can potentially modify the code. Though I can change to void if 
> you prefer.

It's a fair point, but I guess I look at copy_propagation_recursive() 
and think "recording stores should never make any progress; all the 
magic happens in rewriting loads". This becomes especially true if this 
function gets renamed to copy_prop_record_store() or something, which I 
think it should.

> 
>>> +static bool copy_propagation_recursive(struct hlsl_ctx *ctx, struct 
>>> hlsl_block *block,
>>> +        struct copy_propagation_state *state)
> 
> So, no comment for a function called "recursive" which is not (yet) 
> actually recursive? :-P

I'll just go with "I requested the functions' names be changed anyway" ;-)

I don't particularly mind this being called copy_prop_recurse(), even if 
it doesn't recurse yet, since I think it probably should (even if it's 
just as simple as invalidating the whole state on entry and exit of a CF 
block). On the other hand renaming to copy_prop_transform_block() or 
something would also be reasonable.

> 
>>> @@ -1354,7 +1601,12 @@ int hlsl_emit_dxbc(struct hlsl_ctx *ctx, 
>>> struct hlsl_ir_function_decl *entry_fun
>>>           progress |= transform_ir(ctx, split_struct_copies, body, 
>>> NULL);
>>>       }
>>>       while (progress);
>>> -    while (transform_ir(ctx, fold_constants, body, NULL));
>>> +    do
>>> +    {
>>> +        progress = transform_ir(ctx, fold_constants, body, NULL);
>>> +        progress |= copy_propagation_pass(ctx, body);
>>
>> This probably isn't worth examining, but I'm curious why constant 
>> folding is useful after copy-prop.
> 
> Copy propagation might enable some additional steps of constant folding, 
> if a constant is first stored, than loaded and then an operation is 
> carried on it. OTOH, constant folding can enable some additional step of 
> copy propagation if it makes an offset statically visible.

Ah, of course, that should have been obvious...