[PATCH v2 1/3] vcomp: Implement _vcomp_for_dynamic_init_i8()

Fri Mar 4 11:18:08 CST 2022

On 3/2/22 23:57, David Koller wrote:
> On 2/22/22 16:42, Zebediah Figura wrote:
>>> Creating new separate 64-bit data structures for the 64-bit calls
>>> would necessitate adding a bunch of new and redundant code (to
>>> initialize the new structs, etc.), but would be the most performant.
>>> Increasing the 32-bit members to 64-bit would waste the most memory
>>> and be the least performant for 32-bit calls, but would eliminate
>>> redundant code. My choice of unions is a compromise; new data
>>> structures are unnecessary and the separate 32-bit and 64-bit calls
>>> each access their native data types, which is good for performance.
>>>
>>> Using unions in this way is a traditional approach in OpenMP runtime
>>> implementations. For example, if you look at the code for libgomp, the
>>> most widely used OpenMP implementation, you will see that it uses
>>> unions to handle 32/64-bit variants in a similar manner:
>>>
>>>     https://github.com/gcc-mirror/gcc/blob/1931cbad498e625b1e24452dcfffe02539b12224/libgomp/libgomp.h#L288-L396
>>>
>>
>> Increasing the existing fields to 64-bit uses the exact same amount of
>> memory as using a union. I don't know why libgomp did that, but it seems
>> pointless to me.
> 
> The reason that libgomp (and my submitted patches) use unions is so
> that the OpenMP runtime functions can operate on either 32-bit or 64-bit
> values, as needed. This yields better performance on some architectures;
> for example, the 32-bit functions would run slower if they had to do
> type conversions and operate on 64-bit data types using 32-bit registers.

Okay, I failed to consider that the first time. If it makes a 
difference, I guess...

> 
>> Note that I also asked for tests as this may help determine what the
>> correct approach is. E.g. if mixing 32- and 64-bit types works without a
>> hitch, that suggests that widening the existing fields is the correct
>> approach.
> 
> For each dynamically-scheduled loop, OpenMP-aware compilers will emit
> *either* a series of calls to the 32-bit functions, *or else* a series
> of calls to the 64-bit functions. The same instance of the data structure
> will never be accessed by a mix of 32-bit functions and 64-bit functions.
> I think it is better to have separate union fields, so that the 32-bit
> and 64-bit functions can each access their "native" data types.
> 
>>> Yes, if the top priority was to minimize redundant code, then
>>> refactoring these functions to handle both the 32-bit and 64-bit cases
>>> is possible. However, performance would suffer. If you look at the
>>> other previously-implemented vcomp functions, you'll see that earlier
>>> Wine contributors implemented separate 32-bit and 64-bit versions of
>>> functions for other scheduling modes: _vcomp_for_static_init() and
>>> _vcomp_for_static_init_i8(), _vcomp_for_static_simple_init() and
>>> _vcomp_for_static_simple_init_i8(), etc. My patches follow this same
>>> pre-established pattern.
>>
>> Just from reading the code, it seems that the "static" initializers are
>> something of a different case, since they're writing to fields which are
>> by definition either 32- or 64-bit. If we were to use the same fields,
>> that concern wouldn't seem to apply here, although I may be missing
>> something.
> 
> Consider the two Wine functions _vcomp_for_static_simple_init() and
> _vcomp_for_static_simple_init_i8(). The code for these functions is
> identical, except one operates on 32-bit values, and the other on
> 64-bit values. I assume the original Wine author used duplicative code
> for performance reasons, as did I in my patches that implement the
> dynamic schedule routines.  I could re-write my patches to minimize
> code duplication (e.g. implement the 32-bit functions as wrappers around
> the 64-bit ones), but the result would be lower performance due to the
> type conversions, which is important in performance-sensitive code
> such as an OpenMP implementation.
> 
> To summarize, my vcomp patches use unions and some redundant code so
> that both the 32-bit and 64-bit functions have optimal performance, and
> are consistent with previously-implemented Wine vcomp routines. As you've
> pointed out, the same functionality could be achieved by widening the
> data structures to 64-bit and re-factoring the code; this would reduce
> duplicate code (DRY), but would also reduce the 32-bit performance.
> I prefer things the way I submitted them, but I respect your years of
> experience with the Wine codebase and will happily edit and resubmit my
> patches if you think DRY is more important than optimal performance. The
> main goal is to improve support for the many OpenMP applications that
> have been unusable for years due to Wine's incomplete and buggy vcomp
> implementation.

Well, I don't have experience in vcomp at all, and I don't think I have 
the ability to make anything other than suggestions here. In general we 
tend to take a defensive posture, since applications too frequently 
abuse Windows APIs and do things that aren't documented, but if the 
relevant code is only ever generated by a higher-level compiler, I guess 
that concern wouldn't exist.