Wine GPU decoding

Mon Mar 31 14:14:45 CDT 2014

Hi,

it is probably still a bit early, but nevertheless I would like to
announce a feature I am currently working on and present you the first
results. As some of you have already noticed
(http://bugs.winehq.org/show_bug.cgi?id=35868) I've recently submitted
some simple stub patches to add the dxva2 dll. The original purposes was
to get a browser plugin working, which expects that this library is
available, and otherwise refuses to run. The library exports some
functions which are used by several applications (like VLC, Flash,
Silverlight, ...) for GPU decoding. I started to work to on these
functions and I want to present you a first result which you can see
here: https://dl.dropboxusercontent.com/u/61413222/dxva2.png

This is actually the windows version of VLC playing a MPEG2 movie with
GPU acceleration using DXVA2. My implementation of dxva2 uses the VAAPI
on Linux to do the actual gpu decoding and should support AMD, Intel and
NVIDIA cards.

Currently only MPEG2 decoding is supported as it is one of the easier
codecs and other ones like H264 needs a lot more of buffers, which need
be translated from the DXVA format to VAAPI. The second easiest codec to
implement would be mpeg4 but as none of my graphic cards support mpeg4,
I will most probably continue with VC-1. Anyway, I need to clean up the
patches a bit as they add about 3000 new lines of code and test it with
some other graphic cards before I can provide them, but there are also
some problems, mostly d3d9 related, for which I would like to get your
opinion.

The most difficult part is that DXVA2 is completely based on
Direct3D9Device and Direct3DSurface9. The DXVA2 places the output images
into a Surface and the applications locks the surface to get the output
data or simply presents it to the screen. Although it would be much more
efficient to directly blit the data in the graphic card at least VLC
reads it back into system memory as the decoding and output pipeline are
separated.

The problem is that I actually need to allocate twice the amount of
memory for decoding. since I need to provide the Direct3D Surfaces to
the application and I also need to provide buffers to VAAPI. This is not
a big problem for mpeg2 since it only uses 3 output images as a B Frame
can only reference the last and the next frame. Anyway, for H264 this is
getting insane as it requires to store up to 16 output images so that i
would need to allocate 16 VAAPI buffers and 16 Direct3D surfaces.

Currently i lock both kind of buffers after rendering a frame and do the
synchronization in system memory, which is kind of inefficient depending
on the surface type. My original idea was to do the copy in the graphic
card as I can copy the image to a texture after decoding, but after
Sebastian implemented this part we found out that the VAAPI implies a
format conversion to RGB when copying data to a texture. This is
actually a no go since VLC will refuse to use hardware acceleration when
the output format is RGB. I also think it is kind of stupid to convert
the RGB data back to YUV so that we end up with 3 color coder conversion
(YUV->RGB->YUV->RGB). Some Intel developer wrote (see vaCopySurfaceGLX()
at http://markmail.org/message/a3sav6q3dm5qvmat) that it would be
possible to implement a copy in NV12 format for NVIDIA and Intel but not
AMD. We could try to ask them to implement it, so that we can at least
do it efficient for these two vendors.

Anyway, if other applications continue to copy the data back to system
memory it might be better to instead wrap the VAAPI buffers as Direct3D9
surfaces so that we can directly map the VAAPI buffers when LockRect()
is called instead of copying the data. Though this would imply problems
when the applications tries to pass this interface to Present().

So what do the wined3d guys think? Is it better to convince the Intel
developers to allow a copy in YUV format and copy the data directly into
the texture of an Direct3D9 surface or wrap the VAAPI buffers as
Direct3D9Surface and add some glue code when an applications tries to
render it? Or do you have any better ideas?

Regards,
Michael