[PATCH v2 1/3] vcomp: Implement _vcomp_for_dynamic_init_i8()

Wed Mar 2 23:57:16 CST 2022

On 2/22/22 16:42, Zebediah Figura wrote:
>> Creating new separate 64-bit data structures for the 64-bit calls
>> would necessitate adding a bunch of new and redundant code (to
>> initialize the new structs, etc.), but would be the most performant.
>> Increasing the 32-bit members to 64-bit would waste the most memory
>> and be the least performant for 32-bit calls, but would eliminate
>> redundant code. My choice of unions is a compromise; new data
>> structures are unnecessary and the separate 32-bit and 64-bit calls
>> each access their native data types, which is good for performance.
>> 
>> Using unions in this way is a traditional approach in OpenMP runtime
>> implementations. For example, if you look at the code for libgomp, the
>> most widely used OpenMP implementation, you will see that it uses
>> unions to handle 32/64-bit variants in a similar manner:
>> 
>>    https://github.com/gcc-mirror/gcc/blob/1931cbad498e625b1e24452dcfffe02539b12224/libgomp/libgomp.h#L288-L396
>> 
>
> Increasing the existing fields to 64-bit uses the exact same amount of
> memory as using a union. I don't know why libgomp did that, but it seems
> pointless to me.

The reason that libgomp (and my submitted patches) use unions is so
that the OpenMP runtime functions can operate on either 32-bit or 64-bit
values, as needed. This yields better performance on some architectures;
for example, the 32-bit functions would run slower if they had to do
type conversions and operate on 64-bit data types using 32-bit registers.

> Note that I also asked for tests as this may help determine what the
> correct approach is. E.g. if mixing 32- and 64-bit types works without a
> hitch, that suggests that widening the existing fields is the correct
> approach.

For each dynamically-scheduled loop, OpenMP-aware compilers will emit
*either* a series of calls to the 32-bit functions, *or else* a series
of calls to the 64-bit functions. The same instance of the data structure
will never be accessed by a mix of 32-bit functions and 64-bit functions.
I think it is better to have separate union fields, so that the 32-bit
and 64-bit functions can each access their "native" data types.

>> Yes, if the top priority was to minimize redundant code, then
>> refactoring these functions to handle both the 32-bit and 64-bit cases
>> is possible. However, performance would suffer. If you look at the
>> other previously-implemented vcomp functions, you'll see that earlier
>> Wine contributors implemented separate 32-bit and 64-bit versions of
>> functions for other scheduling modes: _vcomp_for_static_init() and
>> _vcomp_for_static_init_i8(), _vcomp_for_static_simple_init() and
>> _vcomp_for_static_simple_init_i8(), etc. My patches follow this same
>> pre-established pattern.
>
> Just from reading the code, it seems that the "static" initializers are
> something of a different case, since they're writing to fields which are
> by definition either 32- or 64-bit. If we were to use the same fields,
> that concern wouldn't seem to apply here, although I may be missing
> something.

Consider the two Wine functions _vcomp_for_static_simple_init() and
_vcomp_for_static_simple_init_i8(). The code for these functions is
identical, except one operates on 32-bit values, and the other on
64-bit values. I assume the original Wine author used duplicative code
for performance reasons, as did I in my patches that implement the
dynamic schedule routines.  I could re-write my patches to minimize
code duplication (e.g. implement the 32-bit functions as wrappers around
the 64-bit ones), but the result would be lower performance due to the
type conversions, which is important in performance-sensitive code
such as an OpenMP implementation.

To summarize, my vcomp patches use unions and some redundant code so
that both the 32-bit and 64-bit functions have optimal performance, and
are consistent with previously-implemented Wine vcomp routines. As you've
pointed out, the same functionality could be achieved by widening the
data structures to 64-bit and re-factoring the code; this would reduce
duplicate code (DRY), but would also reduce the 32-bit performance.
I prefer things the way I submitted them, but I respect your years of
experience with the Wine codebase and will happily edit and resubmit my
patches if you think DRY is more important than optimal performance. The
main goal is to improve support for the many OpenMP applications that
have been unusable for years due to Wine's incomplete and buggy vcomp
implementation.