[PATCH 1/2] winegstreamer: Add helper for GstCaps <-> IMFMediaType conversion.

Fri Mar 27 13:08:21 CDT 2020

On 3/27/20 11:32 AM, Zebediah Figura wrote:

>
> On 3/27/20 10:05 AM, Derek Lesho wrote:
>> On 3/26/20 4:56 PM, Zebediah Figura wrote:
>>> There's another broad question I have with this approach, actually,
>>> which is fundamental enough I have to assume it's at had some thought
>>> put into it, but it would be nice if that discussion happened in a more
>>> public place, and was justified in the patches sent.
>>>
>>> Essentially, the question is: what if we were to use decodebin directly?
>>>
>>> As I understand (and admittedly Media Foundation is far more complex
>>> than I could hope to understand) an application which just calls
>>> IMFSourceResolver methods just needs to get back a working
>>> IMFMediaSource, and we could wrap decodebin with one of those, similar
>>> to the quartz wrapper.
>> The most basic applications (games) seem to either use a source reader
>> or simple sample grabber media session to get their raw samples.  If you
>> want to add a hack for using decodebin, you can easily add a special
>> source type, and for the media source of that type, just make a
>> decodebin element instead of searching for a demuxer.  In this case, the
>> source reader wouldn't search for a decoder since the output type set by
>> the application would be natively supported by the source.  Then, as
>> part of the hack, just always yield that source type in the source
>> resolver.  This is completely incorrect and probably shouldn't make it's
>> way into mainline, IMO.  Also, I have reason to believe it may break
>> Unity3D, as they do look at the native media types supported by the
>> source, and getting around this would require adding some hackery in the
>> source reader.
> My assertion is this isn't really a "hack".
I think that if you have to modify media foundation code to workaround 
shortcuts in winegstreamer, it can be classified as a hack.  It is 
probable that most games will work with it, but I think it makes more 
sense as a staging enhancement.
>   This is something that's
> reasonable to do, and that fits within the design of Media Foundation.
I have a hard time subscribing to the idea that this is within the 
design of media foundation.  I took a look on github, and a good amount 
of applications find desired streams using the subtype from the source 
reader's GetNativeMediaType.  If we were to output uncompressed types, 
this would break.  To work around this, we'd either have to expose 
incorrect media types on our streams, and add an exception to the 
decoder finding behavior in the source reader and topology loader, or 
expose some private interface for getting the true native types.  And in 
either case, we'd still have to conversion of caps for a compressed 
media type.
> It's changing the implementation details, not the API contract. We have
> the freedom to do that.

>
>>> First of all, this is something I think we want to do anyway. Microsoft
>>> has no demuxer for, say, Vorbis (at least, there's not one registered on
>>> my Windows 10 machine), but I think that we want to be able to play back
>>> Vorbis files anyway (in, say, a Win32 media player application).
>> I'm pretty sure our goal is not to extend windows functionality.
> Actually, I'd assert the opposite. Host integration has always been a
> feature of Wine, not a bug. That goes beyond just mapping program
> launcher entries to .desktop files; it includes things like:
>
> * mapping host devices to DOS drives,
> * allowing unix paths to be used in file system functions,
> * exposing the unix file system as a shell folder,
> * making winebrowser the default browser (instead of explorer),
> * exposing public Wine-specific exports from ntdll (those not prefixed
> with a double underscore),
> * making use of host credentials in advapi32 (on Mac, anyway),
> * exposing host GStreamer and QuickTime codecs in DirectShow.
>
> We extend host functionality to integrate with the system, and to make
> using Wine easier. Using host codecs from mfplat does both.
I'm unsure why anyone would want to use a windows media player over 
something like VLC.  But as I mentioned earlier, it is possible to add a 
hack using decodebin with minimal effort, and we could possibly only use 
this hack as a fallback if the container have a registered byte stream 
handler.  I think we would get the best of both worlds with this solution.
>
>>>    Instead
>>> of writing yet another source for vorbis,
>> You don't "write another source", you just expose a new source object
>> and link it with a new source_desc structure, which specifies the mime
>> type of the container format:
>> https://github.com/Guy1524/wine/blob/mfplat_rebase/dlls/winegstreamer/media_source.c#L25
>>>    and for each other obscure
>>> format, we just write one generic decodebin wrapper.
>> Not to mention, you'd have to perform this step with a decodebin wrapper
>> anyway.
> The amount of abstraction, and the amount of actual code you have to
> add, is beside the point, but it's also not quite as simple as you make
> out there:
>
> * First and foremost, we also need to add caps conversion functions,
> since vorbisparse doesn't output raw video, and we need to be able to
> feed it through theoradec afterwards.
You need that anyway, chromium manually creates H.264 encoder and 
decoder instances and uses them without anything from the control 
layer.  Because of this, we will at-least need to keep the 
mediatype->caps conversion function for compressed types.
>
> * Also, I'm guessing you haven't dealt with "always" pads yet;
> vorbisparse doesn't send "no-more-pads".
That would be ever easier to support.
>
> * In the case that elements get added, removed, or changed from upstream
> GStreamer, we have to reflect that here.
Elaborate?
>
> By contrast, the amount of code we have to add to deal with a new format
> when using decodebin is *exactly zero*. We don't actually have to write
> "audio/x-vorbis" anywhere in our code.
Okay, adding that path as a fallback makes a lot of sense then, since we 
still have full ability to fix compatibility issues with types that are 
natively supported in windows.
>   After all, we don't write it
> anywhere in quartz, and yet Vorbis still works. (If an application were
> to ask what the stream type is—and I doubt any do—we report it as
> MEDIATYPE_Stream, MEDIASUBTYPE_Gstreamer).
>
>>> Second of all, the most obvious benefit, at least while looking at these
>>> patches, is that you now don't need to write caps <-> IMFMediaType
>>> conversion for every type on the planet.
>> I don't see this as a problem, most games I've seen will use either
>> H.264 of WMV, and adding new formats isn't that difficult.  You look at
>> the caps exposed by the gstreamer demuxer, find the equivalent
>> attributes in media foundation, and fill in the gaps.  In return you get
>> correct behavior, and a source that can be paired with a correctly
>> written MFT from outside of the wine source.
> This is basically true until it isn't. And it already isn't true if we
> want to support host codecs. An "add it when we need it" approach is
> going to be hell on media players.
>
> I also think you're kind of underestimating the cost here. I don't like
> making LoC arguments, but your code to deal with those caps is something
> like 370 LoC, maybe 350 LoC with some deduplication.
As mentioned earlier in the email, the IMFMediaType->caps path will 
always be necessary, to support the decoder transforms, which real 
applications do use by themselves.
>   There's also the
> developer cost of looking up what GStreamer caps values mean (which
> usually requires looking at the source), looking up the Media Foundation
> attributes, testing them to ensure that the conversion is correct,
> figuring out how to deal with caps that either GStreamer or Media
> Foundation can't handle...
>
>>>    Another benefit is that you let
>>> all of the decoding happen within a single GStreamer pipeline, which is
>>> probably better for performance.
>> I have applications working right now with completely acceptable
>> performance, and we are still copying every uncompressed sample an extra
>> time, which we may be able to optimize away.  Copying compressed
>> samples, on the other hand, is not that big of a deal at all.
> I don't doubt it works regardless. DirectShow did too, back before I got
> rid of the transforms. It's also not the main reason I'm proposing this.
>
> On the other hand, decreasing CPU usage is also nice.
How would this reduce CPU usage?
>
> Another thing that occurred to me is, letting everything happen in one
> GStreamer pipeline is nice for debugging.
I disagree, decodebin adds complexity to the pipeline that isn't 
otherwise necessary, like typefind.
>
>>>    You also can simplify your
>>> postprocessing step to adding a single videoconvert and audioconvert,
>>> instead of having to manually (or semi-manually) add e.g. an h264 parser
>>> element.
>> It isn't manual, we find a parser which corrects the caps.  And as I
>> mentioned in earlier email, we could also use caps negotiation for this,
>> all the setup is in place.
> Hence "semi-manually". You still have to manually fix the caps so that
> the element will be added.
As mentioned, we will need this regardless.
>
>>>    These are some of the benefits I had in mind when removing the
>>> GStreamer quartz transforms.
>>>
>>> Even in the case where the application manually creates e.g. an MPEG-4
>>> source, my understanding is it's still the source's job to automatically
>>> append transforms to match the requested type.
>> It's not the source's job at all.  On windows, where sources are
>> purpose-built, they apply no transformations to the types they get,
>> their goal is only to get raw sample data from a container / stream.
>> It's the job of the media session, or source reader to apply transforms
>> when needed.
> I see, I confused the media source with the source reader. I guess that
> argument isn't valid, but I don't think it really affects my conclusion.
>
>>>    We'd just be moving that
>>> from the mfplat level to the gstreamer level—i.e. let decodebin select
>>> the 'transforms' needed to convert to raw video and audio.
>> The media session and source reader shouldn't be affected by
>> winegstreamer details.  If a user/an application decides to install a
>> third party decoder, we still need the infrastructure in place for this
>> to function.
>>> It obviously wouldn't match native structure, but it's not clear to me
>>> that it would fail to match native in a way that would cause problems.
>>> Judging from my experience with quartz, most applications aren't going
>>> to care how their media is decoded as long as they get raw samples out
>>> of it.
>> Most games, or most applications?  Chromium uses media foundation in a
>> much more granular way.
> Yes, most applications.
>
> What does Chromium do?
As mentioned earlier, uses decoders and encoders manually, so we'll have 
to fix up/parse the data we get anyway.
>
>>>    Only a select few build the graph manually because they don't
>>> realize that they can autoplug, or make assumptions about which filters
>>> will be present once autoplugging is done, and some of those even fall
>>> back to autoplugging if their preferred method fails. Maybe the
>>> situation is different with mfplat, but given that there is a way to let
>>> mfplat figure out which sources and transforms to use, I'm gonna be
>>> really surprised if most applications aren't using it.
>>>
>>> If you do come across an application that requires we mimic native's
>>> specific arrangement of sources and transforms, it seems to me it
>>> wouldn't require that much effort to swap a different parser in for
>>> decodebin, and to implement the necessary bits in the media type
>>> conversion functions. Ultimately I suspect it'd be less work to have a
>>> decodebin wrapper + specific sources for applications that require them,
>>> than to manually implement every source and transform.
>> The current solution isn't very manual, and, as I mentioned earlier in
>> this email, you also can construct a decodebin wrapper source using the
>> infrastructure which is available.  And in general terms, I think it's
>> more work to maintain a solution that doesn't match up to windows, as we
>> now have to think of all these edge cases and how to work around them.
> What edge cases do you mean?
Cases where applications expect compressed streams from the source.
>
>> On 3/26/20 8:07 PM, Zebediah Figura wrote:
>>> While I await your more complete response, I figure I might as well
>>> clarify some things.
>>>
>>> I don't think that "doing the incorrect thing", i.e. failing to exactly
>>> emulate Windows, should necessarily be considered bad in itself, or at
>>> least not nearly as bad as all that.
>>>
>>> My view, and my understanding of the Wine project's view in general as
>>> informed by its maintainers, is that emulating Windows is desirable for
>>> public documented behaviour (obviously), for undocumented behaviour that
>>> applications rely on (also obviously), for undocumented or
>>> semi-documented behaviour where there's no difference otherwise and
>>> where the native thing to do is obvious (e.g. the name of an internal
>>> registry key).
>> In my view, when completely incorrect behavior is only a few function
>> calls away, that's not acceptable.  The media source is a well
>> documented public interface, and doing something different instead is
>> just asking for trouble.
> The media source is a documented public interface, but *which* media
> source is returned from IMFSourceResolver is not documented or
> guaranteed, and which transforms are returned from the source reader is
> also not guaranteed.
>
> Using decodebin is not "completely incorrect", and emulating Windows'
> specific arrangement of sources and transforms is not "a few function
> calls away".
Finding out the media type of a source is one function call away.
>   It's several hundred lines of code to do caps conversion,
> the entire transform object (which, to be sure, we might need *anyway*,
We will.
> but also might not), and it means more work every time we have to deal
> with a new codec.
Unless we implement the decodebin solution as a fallback for unknown 
types.  Taking the fallback approach means we will only have to go 
through this process for every type natively supported by windows.
>
>>> But there's not really a reason to emulate Windows otherwise. And in a
>>> case like this, where there's a significant benefit to not emulating
>>> Windows exactly, the only reason I see is "an application we don't know
>>> yet *might* depend on it". When faced with such a risk, I weigh the
>>> probability of that happening—and on the evidence of DirectShow
>>> applications, I see that as low—with the cost of having to change
>>> design—which also seems low to me; I can say from experience (c.f.
>>> 5de712b5d) that swapping out a specific demuxer for decodebin isn't very
>>> difficult.
>> The converse of this is also true, if you want to quickly experiment
>> with some gstreamer codec that we don't support yet, you just perform
>> the hack I mentioned earlier, and then after you get it working you make
>> it correct by adding the necessary gstreamer caps. Another hack we could
>> use is to serialize the compressed caps, and throw them in a
>> MF_MT_USER_DATA attribute, and hope that an application never looks.
> Sure. But I'm willing to assert that one of these things is more likely
> than the other. I'm prepared to eat my words if proven wrong.
What do you mean, that in most cases applications won't care how they 
get their samples?  That may be true, but I still think the edge cases 
are big enough to warrant the accurate approach. Unity3D, a pretty 
important user of this work, gets native media types of the source for 
instance.  What they use it for, I'm not sure, but I wouldn't take any 
chances.
>
>> But as I mentioned earlier, I don't think the amount of work required
>> for adding a new media type is excessive.  Microsoft only ships a
>> limited amount of sources and decoders, they fit on a single page:
>> https://docs.microsoft.com/en-us/windows/win32/medfound/supported-media-formats-in-media-foundation
>> , so it's not like we'll be adding new types for years to come.
> That's seven demuxers and sixteen transforms, which is still kind of a
> lot. It also, unsurprisingly, isn't every format that Windows supports;
> just looking at my Windows 7 VM I see also NSC and LPCM, and a much
> longer list of transforms.
>
> And it doesn't take into account host codecs.
Insert fallback argument here :P
>
>>> Not to mention that what we're doing is barely "incorrect". Media
>>> Foundation is an API that's specifically meant to be extended in this
>>> way.
>> I don't think Microsoft ever meant for an application to make a media
>> source that decodes compressed content, the source reader and media
>> session exist for a reason.
> I don't think they specifically meant for an application *not* to do
> that. It fits within the design of Media Foundation. The reason that
> transforms exist—in any media API—is because different containers can
> hold the same video or audio codec. GStreamer can already deal with that.
>
>>>    For that matter, some application could easily register its own
>>> codec libraries on Windows with a higher priority than the native ones
>>> (this happened with DirectShow); that's essentially no different than
>>> what I'm suggesting.
>> Yes, but even in that case, I assume they will still follow the basic
>> concept of what a source is and is not.
> I wouldn't necessarily assert that. A codec library—like GStreamer—might
> have its own set of transforms and autoplugging code. Easier to reuse
> that internally than to try to integrate it with every new decoding API
> that Microsoft releases.
That could potentially break other applications though, and I don't 
think codec libraries are comparable to gstreamer, they usually just 
handle a specific task and plug into the relevant part of the media API, 
whether it be dshow, media foundation, or gstreamer.
>
>>> I think the linked commit misses the point somewhat. That's partially
>>> because I don't think it makes sense to measure simplicity as an
>>> absolute metric simply using line count,
>> It's not just line count, the code itself is very simple, all we are
>> doing is registering the supported input and output types of the
>> decoder, setting the mime type of the container format for the source,
>> and and registering both objects.
>>>    and partially because it's
>>> missing the cost of adding other media types to the conversion functions
>> You can use the MF_MT_USER_DATA serialization hack if you're worried
>> about that.
> Unless you're proposing we use that in Wine, that doesn't affect anything.
You're right, the decodebin fallback is a much cleaner solution than that.
>
>>> (which is one of the reasons, though not the only reason, I thought to
>>> write this mail). But it's mostly because the cost of using decodebin,
>>> where it works, is essentially zero:
>> Except in the cases where an application does something unexpected.
> In which case the cost is still no more than the cost of not using
> decodebin.
>
>>>    we write one media source,
>>>    and it
>>> works for everything; no extension for ASF required.
>> There already is only one real implementation of the media source, the
>> only "extension" is adding the mime type instead of using typefind.  We
>> will register the necessary byte stream handlers no matter which path we
>> take.
> Well, ideally we'd do what quartz does, and register a handler that
> catches every file, and returns a subtype that essentially identifies
> GStreamer.
>
>>>    If it never becomes
>>> necessary to write a source that outputs compressed samples, then we
>>> also don't have the cost of abstraction (which is always worth taking
>>> seriously!), and if it does, we come out even—we can still use your
>>> generic media source, or something like it.
>>>
>>> Ultimately, I think that a decodebin wrapper is something we want to
>>> have anyway, for the sake of host codecs like Theora,
>> Where would we use support for Theora, if no windows applications are
>> able to use it.
> Anything which wants to be able to play back an arbitrary media file,
> i.e. generic media players, mostly. I see all sorts of bug reports for
> these with Quartz, so people are definitely using them.
Heh.
>
>>>    and once we have
>>> it, I see zero cost in using it wherever else we can.