D3D performance debugging report

Sat Apr 30 11:26:04 CDT 2011

Hi Stefan,

What do you think about using inline spinlocks (in asm code maybe) to 
implement locks?
Clearly an optimized spinlock would mean different code for different 
compilers/architectures, but shouldn't it be the best solution?
For your reference, once I commented out the GL locks to see StarCraft 2 
performance, but it crashed straight away.

What do you reckon?

Cheers,

P.s Keep up with this fantastic work! :-)

On 30/04/11 16:18, Stefan Dösinger wrote:
> Hi,
> Here's another update.
>
> First I expanded my performance tests at https://84.112.174.163/~git/perftest
> a bit. The old tests were renamned to streamsrc_d3d and streamsrc_gl, and I
> added another set of tests that just tests the draw overhead without ever
> changing any states: drawprim_d3d and drawprim_gl. Here are the performance
> results from Windows 7:
>
> drawprim_gl:	~1154 fps
> drawprim_d3d:	~1160 fps
>
> In Wine the D3D version gets 165.67fps fps. The Linux native GL version gets
> 1791 fps. The GL windows version in Wine gets about 600 fps(FIXME!). Don't
> worry too much about the GL performance, this is mostly locking overhead. More
> about that later.
>
> I ran my usual d3d performance hacks through the d3d version. The hacks are
> pretty much the same as with the stremsrc test, except that I don't need the
> redundant vertex shader apply hacks. I attached a tarball with the hacks and a
> file listing their performance impact.
>
> The plan forward is still the same: Write more of those tests(especially tests
> that test non-draw stuff like resource loads), improve the tests and hope that
> real apps profit.
>
> The optimistic scenario is that this works out. So far we've seen slow
> movementin real apps with the two fixes we've made(context_validate and FBO
> application, the latter isn't in Wine yet). This is expected to a certain
> extend, because the performance is reversely proportional to the number of
> performance bugs we have. So we'll have to remove a lot of them before we see
> big movement.
>
> The pessimistic scenario is that those tests have nothing in common with the
> performance bugs in real apps and the fixes only end up making the code more
> complex.
>
> To that end I think I'll create a github repo where I try to get the hacks
> into a somewhat usable state - not commitable to wine, but good enough that
> they don't break apps, so they can be tested against real world apps. That way
> we can find out how much they really improve real games without clogging our
> codebase without certainty that the changes help.
>
> Here are again some descriptions of the hacks I tested:
>
> 2) End-user business, fairly harmless. Should always be used if performance is
> important
>
> 3, 4) Will break stuff. Can be fixed, but would be rather ugly. Probably
> interesting once we run out of easier fixes
>
> 5) Could go into Wine sooner or later. Does improve real games on its own
> already
>
> 6) Easy to clean up, I'll send a patch today. we can skip validation if FIXMEs
> are off since nobody will see them.
>
> 7) I tried to find out if removing one call level helps, but it doesn't even
> improve this locking overhead sensitive test app. Forget about it
>
> 8) Doable, but pretty uninteresting. I doubt we'll get a noticeable
> improvement in a real app
>
> 9-11) Distributor / End use choice. Note that some compiler flags(especially
> the framepointer one) can break apps and copy protection systems.
>
> 12) Distributor / End user choice too, but harmless. Not much gain compared to
> WINEDEBUG=-all though
>
> 13) Doesn't improve performance a whole lot once debug msgs are compiled out.
>
> 14) We should be able to limit calls to this functions to cases where the
> textures were changed or vertex texture fetch is used. We may be able to
> eliminate it entirely when we have enough samplers available
>
> 15, 16) I caution against too much optimism here. We won't be able to get rid
> of the locking anytime soon. Maybe the EnterCriticalSection /
> LeaveCriticalSection performance can be improved. A part of the problem is
> call overhead, but I think the biggest issue are the locked increment and
> decrement operations in RtlEnterCriticalSection / RtlLeaveCriticalSection.
> Orig performance: 178 fps
> Interlocked ops replaced with normal inc/dec: 244 fps
> Lock calls removed from wined3d: 293 fps
> (this is just to give you some idea where the time is spent)
>
> 17) Forget about this one until we run out of other optimizations
>
> 18) It's interesting how much this gives without all the other optimizations.
> My app doesn't use any textures, so this is just the call overhead and loping
> over the fragment samplers.
>
> 19) My app renders to a too small window, so swapchain render_to_fbo triggers.
> It's interesting that getting rid of it makes performance worse
>
> 21) Removing that and other checks in drawPrimitive() barely speeds up the
> test. I got a total of 7-8 fps out of the compatibility or error checks in
> drawPrimitive, this won't show up in any real app.
>
> Stefan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.winehq.org/pipermail/wine-devel/attachments/20110430/0fa2b901/attachment-0001.htm>