[PATCH] d3dx9: Improve performance and memory usage in preshader constants setting.

Tue May 31 04:48:23 CDT 2016

On 05/31/2016 12:24 AM, Matteo Bruni wrote:
>
>> +struct d3dx_const_copy_state
> I'm not entirely happy with the name, maybe
> d3dx_const_param_eval_output or something in that fashion?
I will rename to d3dx_const_param_eval_output.
>
>>                        unsigned int out;
>> +                        unsigned int *in = (unsigned int *)param->data + set_state.param_offset + j;
>> +                        switch (table_info[set_state.table].type)
>> +                        {
>> +                            case PRES_VT_FLOAT: set_number(&out, D3DXPT_FLOAT, in, param->type); break;
>> +                            case PRES_VT_INT: set_number(&out, D3DXPT_INT, in, param->type); break;
>> +                            case PRES_VT_BOOL: set_number(&out, D3DXPT_BOOL, in, param->type); break;
>> +                            default:
>> +                                FIXME("Unexpected type %#x.\n", table_info[set_state.table].type);
>> +                                break;
>> +                        }
>> +                        regstore_set_values(rs, set_state.table, &out, set_state.table_offset + j, 1);
>> +                    }
>> +                }
>> +                set_state.param_offset += count;
>> +                set_state.table_offset += count;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +}
> This seems overly complicated and fragile.
> Instead of this kind of "5 instructions bytecode" isn't it easier to
> just store and then lookup a struct with all the info you need? 
Keeping a struct with offsets & count altogether is of course more
straightforward, but I intentionally did it this way to save a lot of
extra space and to introduce an easier coalescing. When there is a
matrix transpose involved (which happens quite often in real apps as far
as I have seen) the whole structure would be duplicated for every matrix
element. In the current approach each element for transposed matrix has
just one offset (destination offset changes sequentially and does not
come to the array for each matrix element) and one count of 1. The other
way to optimize this is to introduce a separate operation for transposed
matrix copy, but will it really be simpler or nicer than it is now?
Current approach is indeed a sort of encoding the data copy into a very
simple bytecode. Keeping the whole struct and optimizing as a separate
op for matrix transpose would be the same in principle, but I am afraid
will result in a bit more code and a bit more mess as there will be more
cases to handle. I thought the current approach is simpler as the data
copy is encoded in a generic and simple way so on the actual copy I do
not have to care for any specifics of initial parameter & constant
layout, as well as in process of extending the array & coalescing.
> I
> guess the end result would be something akin to storing an array of
> struct d3dx_const_copy_state in place of struct d3dx_const_param_set.
> I don't know how the code would exactly look like but I would start by
> moving building all the struct d3dx_const_copy_param to init time.
I am not sure I understand what you mean here, could you please clarify?
In the variant currently suggested the whole building of structure
responsible for parameter copy is already done solely at init time.
Shader/preshader full constant structure is not stored anymore. There is
still an array of parameters and constant description left which is
currently used in effect.c code for setting sampler states. This could
easily be optimized further by not storing the parameters other than
samplers and storing just register index instead of the whole constant
descs for them, but this looked as a separate step to me not fully
related to this patch. Or probably I am misunderstanding your point here.

> BTW
> the coalescing you currently do in add_const_set() could be
> potentially done as a separate pass after you generate all the
> "d3dx_const_copy_state-like" structures (but still at init time),
> which might be simpler.
It can be done as a separate pass, but why? In my understanding it will
result just in bigger intermediate array and a few more lines of code. I
will need to do all the same but not at once when getting an "copy
request" but from scanning the pre-stored array. Or am I missing something?

> Whitespace.
>> @@ -931,21 +1107,17 @@ static HRESULT set_constants_param(struct d3dx_regstore *rs, struct d3dx_const_t
>>      major_stride = max(minor, table_info[table].reg_component_count);
>>      n = min(major * major_stride,
>>              desc.RegisterCount * table_info[table].reg_component_count + major_stride - 1) / major_stride;
>> +
>>      for (i = 0; i < n; ++i)
>>      {
>>          for (j = 0; j < minor; ++j)
>>          {
>> -            unsigned int out;
>> -            unsigned int *in;
>>              unsigned int offset;
>>
>>              offset = start_offset + i * major_stride + j;
>> -            if (offset / table_info[table].reg_component_count >= rs->table_sizes[table])
>> -            {
>> -                if (table_info[table].reg_component_count != 1)
>> -                    FIXME("Output offset exceeds table size, name %s, component %u.\n", desc.Name, i);
>> +            if ((offset - start_offset) / table_info[table].reg_component_count >= desc.RegisterCount)
>>                  break;
> Maybe keeping a WARN or something makes sense here?
>
This break is just normal for boolean register set (it is covered by
test case) and should not happen for vec4 register sets. This is because
the inner loop can never be bound by desc.RegisterCount for vec4
registers (as nminor <= 4 and limiting just outer 'n' loop counter is
sufficient), but can be bound for boolean single value registers.

I will get it back in the same way it was before (FIXME for vec4
register sets and nothing for boolean).