[PATCH 1/2] winegstreamer: Add helper for GstCaps <-> IMFMediaType conversion.

Fri Mar 27 11:32:07 CDT 2020

On 3/27/20 10:05 AM, Derek Lesho wrote:
> 
> On 3/26/20 4:56 PM, Zebediah Figura wrote:
>> There's another broad question I have with this approach, actually,
>> which is fundamental enough I have to assume it's at had some thought
>> put into it, but it would be nice if that discussion happened in a more
>> public place, and was justified in the patches sent.
>>
>> Essentially, the question is: what if we were to use decodebin directly?
>>
>> As I understand (and admittedly Media Foundation is far more complex
>> than I could hope to understand) an application which just calls
>> IMFSourceResolver methods just needs to get back a working
>> IMFMediaSource, and we could wrap decodebin with one of those, similar
>> to the quartz wrapper.
> The most basic applications (games) seem to either use a source reader 
> or simple sample grabber media session to get their raw samples.  If you 
> want to add a hack for using decodebin, you can easily add a special 
> source type, and for the media source of that type, just make a 
> decodebin element instead of searching for a demuxer.  In this case, the 
> source reader wouldn't search for a decoder since the output type set by 
> the application would be natively supported by the source.  Then, as 
> part of the hack, just always yield that source type in the source 
> resolver.  This is completely incorrect and probably shouldn't make it's 
> way into mainline, IMO.  Also, I have reason to believe it may break 
> Unity3D, as they do look at the native media types supported by the 
> source, and getting around this would require adding some hackery in the 
> source reader.

My assertion is this isn't really a "hack". This is something that's
reasonable to do, and that fits within the design of Media Foundation.
It's changing the implementation details, not the API contract. We have
the freedom to do that.

>>
>> First of all, this is something I think we want to do anyway. Microsoft
>> has no demuxer for, say, Vorbis (at least, there's not one registered on
>> my Windows 10 machine), but I think that we want to be able to play back
>> Vorbis files anyway (in, say, a Win32 media player application).
> I'm pretty sure our goal is not to extend windows functionality.

Actually, I'd assert the opposite. Host integration has always been a
feature of Wine, not a bug. That goes beyond just mapping program
launcher entries to .desktop files; it includes things like:

* mapping host devices to DOS drives,
* allowing unix paths to be used in file system functions,
* exposing the unix file system as a shell folder,
* making winebrowser the default browser (instead of explorer),
* exposing public Wine-specific exports from ntdll (those not prefixed
with a double underscore),
* making use of host credentials in advapi32 (on Mac, anyway),
* exposing host GStreamer and QuickTime codecs in DirectShow.

We extend host functionality to integrate with the system, and to make
using Wine easier. Using host codecs from mfplat does both.

>>   Instead
>> of writing yet another source for vorbis,
> You don't "write another source", you just expose a new source object 
> and link it with a new source_desc structure, which specifies the mime 
> type of the container format: 
> https://github.com/Guy1524/wine/blob/mfplat_rebase/dlls/winegstreamer/media_source.c#L25
>>   and for each other obscure
>> format, we just write one generic decodebin wrapper.
> Not to mention, you'd have to perform this step with a decodebin wrapper 
> anyway.

The amount of abstraction, and the amount of actual code you have to
add, is beside the point, but it's also not quite as simple as you make
out there:

* First and foremost, we also need to add caps conversion functions,
since vorbisparse doesn't output raw video, and we need to be able to
feed it through theoradec afterwards.

* Also, I'm guessing you haven't dealt with "always" pads yet;
vorbisparse doesn't send "no-more-pads".

* In the case that elements get added, removed, or changed from upstream
GStreamer, we have to reflect that here.

By contrast, the amount of code we have to add to deal with a new format
when using decodebin is *exactly zero*. We don't actually have to write
"audio/x-vorbis" anywhere in our code. After all, we don't write it
anywhere in quartz, and yet Vorbis still works. (If an application were
to ask what the stream type is—and I doubt any do—we report it as
MEDIATYPE_Stream, MEDIASUBTYPE_Gstreamer).

>>
>> Second of all, the most obvious benefit, at least while looking at these
>> patches, is that you now don't need to write caps <-> IMFMediaType
>> conversion for every type on the planet.
> I don't see this as a problem, most games I've seen will use either 
> H.264 of WMV, and adding new formats isn't that difficult.  You look at 
> the caps exposed by the gstreamer demuxer, find the equivalent 
> attributes in media foundation, and fill in the gaps.  In return you get 
> correct behavior, and a source that can be paired with a correctly 
> written MFT from outside of the wine source.

This is basically true until it isn't. And it already isn't true if we
want to support host codecs. An "add it when we need it" approach is
going to be hell on media players.

I also think you're kind of underestimating the cost here. I don't like
making LoC arguments, but your code to deal with those caps is something
like 370 LoC, maybe 350 LoC with some deduplication. There's also the
developer cost of looking up what GStreamer caps values mean (which
usually requires looking at the source), looking up the Media Foundation
attributes, testing them to ensure that the conversion is correct,
figuring out how to deal with caps that either GStreamer or Media
Foundation can't handle...

>>   Another benefit is that you let
>> all of the decoding happen within a single GStreamer pipeline, which is
>> probably better for performance.
> I have applications working right now with completely acceptable 
> performance, and we are still copying every uncompressed sample an extra 
> time, which we may be able to optimize away.  Copying compressed 
> samples, on the other hand, is not that big of a deal at all.

I don't doubt it works regardless. DirectShow did too, back before I got
rid of the transforms. It's also not the main reason I'm proposing this.

On the other hand, decreasing CPU usage is also nice.

Another thing that occurred to me is, letting everything happen in one
GStreamer pipeline is nice for debugging.

>>   You also can simplify your
>> postprocessing step to adding a single videoconvert and audioconvert,
>> instead of having to manually (or semi-manually) add e.g. an h264 parser
>> element.
> It isn't manual, we find a parser which corrects the caps.  And as I 
> mentioned in earlier email, we could also use caps negotiation for this, 
> all the setup is in place.

Hence "semi-manually". You still have to manually fix the caps so that
the element will be added.

>>   These are some of the benefits I had in mind when removing the
>> GStreamer quartz transforms.
>>
>> Even in the case where the application manually creates e.g. an MPEG-4
>> source, my understanding is it's still the source's job to automatically
>> append transforms to match the requested type.
> It's not the source's job at all.  On windows, where sources are 
> purpose-built, they apply no transformations to the types they get, 
> their goal is only to get raw sample data from a container / stream.  
> It's the job of the media session, or source reader to apply transforms 
> when needed.

I see, I confused the media source with the source reader. I guess that
argument isn't valid, but I don't think it really affects my conclusion.

>>   We'd just be moving that
>> from the mfplat level to the gstreamer level—i.e. let decodebin select
>> the 'transforms' needed to convert to raw video and audio.
> The media session and source reader shouldn't be affected by 
> winegstreamer details.  If a user/an application decides to install a 
> third party decoder, we still need the infrastructure in place for this 
> to function.
>>
>> It obviously wouldn't match native structure, but it's not clear to me
>> that it would fail to match native in a way that would cause problems.
>> Judging from my experience with quartz, most applications aren't going
>> to care how their media is decoded as long as they get raw samples out
>> of it.
> Most games, or most applications?  Chromium uses media foundation in a 
> much more granular way.

Yes, most applications.

What does Chromium do?

>>   Only a select few build the graph manually because they don't
>> realize that they can autoplug, or make assumptions about which filters
>> will be present once autoplugging is done, and some of those even fall
>> back to autoplugging if their preferred method fails. Maybe the
>> situation is different with mfplat, but given that there is a way to let
>> mfplat figure out which sources and transforms to use, I'm gonna be
>> really surprised if most applications aren't using it.
>>
>> If you do come across an application that requires we mimic native's
>> specific arrangement of sources and transforms, it seems to me it
>> wouldn't require that much effort to swap a different parser in for
>> decodebin, and to implement the necessary bits in the media type
>> conversion functions. Ultimately I suspect it'd be less work to have a
>> decodebin wrapper + specific sources for applications that require them,
>> than to manually implement every source and transform.
> The current solution isn't very manual, and, as I mentioned earlier in 
> this email, you also can construct a decodebin wrapper source using the 
> infrastructure which is available.  And in general terms, I think it's 
> more work to maintain a solution that doesn't match up to windows, as we 
> now have to think of all these edge cases and how to work around them.

What edge cases do you mean?

>>
> On 3/26/20 8:07 PM, Zebediah Figura wrote:
>> While I await your more complete response, I figure I might as well
>> clarify some things.
>>
>> I don't think that "doing the incorrect thing", i.e. failing to exactly
>> emulate Windows, should necessarily be considered bad in itself, or at
>> least not nearly as bad as all that.
>>
>> My view, and my understanding of the Wine project's view in general as
>> informed by its maintainers, is that emulating Windows is desirable for
>> public documented behaviour (obviously), for undocumented behaviour that
>> applications rely on (also obviously), for undocumented or
>> semi-documented behaviour where there's no difference otherwise and
>> where the native thing to do is obvious (e.g. the name of an internal
>> registry key).
> In my view, when completely incorrect behavior is only a few function 
> calls away, that's not acceptable.  The media source is a well 
> documented public interface, and doing something different instead is 
> just asking for trouble.

The media source is a documented public interface, but *which* media
source is returned from IMFSourceResolver is not documented or
guaranteed, and which transforms are returned from the source reader is
also not guaranteed.

Using decodebin is not "completely incorrect", and emulating Windows'
specific arrangement of sources and transforms is not "a few function
calls away". It's several hundred lines of code to do caps conversion,
the entire transform object (which, to be sure, we might need *anyway*,
but also might not), and it means more work every time we have to deal
with a new codec.

>>
>> But there's not really a reason to emulate Windows otherwise. And in a
>> case like this, where there's a significant benefit to not emulating
>> Windows exactly, the only reason I see is "an application we don't know
>> yet *might* depend on it". When faced with such a risk, I weigh the
>> probability of that happening—and on the evidence of DirectShow
>> applications, I see that as low—with the cost of having to change
>> design—which also seems low to me; I can say from experience (c.f.
>> 5de712b5d) that swapping out a specific demuxer for decodebin isn't very
>> difficult.
> The converse of this is also true, if you want to quickly experiment 
> with some gstreamer codec that we don't support yet, you just perform 
> the hack I mentioned earlier, and then after you get it working you make 
> it correct by adding the necessary gstreamer caps. Another hack we could 
> use is to serialize the compressed caps, and throw them in a 
> MF_MT_USER_DATA attribute, and hope that an application never looks.  

Sure. But I'm willing to assert that one of these things is more likely
than the other. I'm prepared to eat my words if proven wrong.

> But as I mentioned earlier, I don't think the amount of work required 
> for adding a new media type is excessive.  Microsoft only ships a 
> limited amount of sources and decoders, they fit on a single page: 
> https://docs.microsoft.com/en-us/windows/win32/medfound/supported-media-formats-in-media-foundation 
> , so it's not like we'll be adding new types for years to come.

That's seven demuxers and sixteen transforms, which is still kind of a
lot. It also, unsurprisingly, isn't every format that Windows supports;
just looking at my Windows 7 VM I see also NSC and LPCM, and a much
longer list of transforms.

And it doesn't take into account host codecs.

>>
>> Not to mention that what we're doing is barely "incorrect". Media
>> Foundation is an API that's specifically meant to be extended in this
>> way.
> I don't think Microsoft ever meant for an application to make a media 
> source that decodes compressed content, the source reader and media 
> session exist for a reason.

I don't think they specifically meant for an application *not* to do
that. It fits within the design of Media Foundation. The reason that
transforms exist—in any media API—is because different containers can
hold the same video or audio codec. GStreamer can already deal with that.

>>   For that matter, some application could easily register its own
>> codec libraries on Windows with a higher priority than the native ones
>> (this happened with DirectShow); that's essentially no different than
>> what I'm suggesting.
> Yes, but even in that case, I assume they will still follow the basic 
> concept of what a source is and is not.

I wouldn't necessarily assert that. A codec library—like GStreamer—might
have its own set of transforms and autoplugging code. Easier to reuse
that internally than to try to integrate it with every new decoding API
that Microsoft releases.

>>
>> I think the linked commit misses the point somewhat. That's partially
>> because I don't think it makes sense to measure simplicity as an
>> absolute metric simply using line count,
> It's not just line count, the code itself is very simple, all we are 
> doing is registering the supported input and output types of the 
> decoder, setting the mime type of the container format for the source, 
> and and registering both objects.
>>   and partially because it's
>> missing the cost of adding other media types to the conversion functions
> You can use the MF_MT_USER_DATA serialization hack if you're worried 
> about that.

Unless you're proposing we use that in Wine, that doesn't affect anything.

>> (which is one of the reasons, though not the only reason, I thought to
>> write this mail). But it's mostly because the cost of using decodebin,
>> where it works, is essentially zero:
> Except in the cases where an application does something unexpected.

In which case the cost is still no more than the cost of not using
decodebin.

>>   we write one media source,
>>   and it
>> works for everything; no extension for ASF required.
> There already is only one real implementation of the media source, the 
> only "extension" is adding the mime type instead of using typefind.  We 
> will register the necessary byte stream handlers no matter which path we 
> take.

Well, ideally we'd do what quartz does, and register a handler that
catches every file, and returns a subtype that essentially identifies
GStreamer.

>>   If it never becomes
>> necessary to write a source that outputs compressed samples, then we
>> also don't have the cost of abstraction (which is always worth taking
>> seriously!), and if it does, we come out even—we can still use your
>> generic media source, or something like it.
>>
>> Ultimately, I think that a decodebin wrapper is something we want to
>> have anyway, for the sake of host codecs like Theora,
> Where would we use support for Theora, if no windows applications are 
> able to use it.

Anything which wants to be able to play back an arbitrary media file,
i.e. generic media players, mostly. I see all sorts of bug reports for
these with Quartz, so people are definitely using them.

>>   and once we have
>> it, I see zero cost in using it wherever else we can.