<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    Hi Stefan,<br>

    <br>

    What do you think about using inline spinlocks (in asm code maybe)

    to implement locks?<br>

    Clearly an optimized spinlock would mean different code for

    different compilers/architectures, but shouldn't it be the best

    solution?<br>

    For your reference, once I commented out the GL locks to see

    StarCraft 2 performance, but it crashed straight away.<br>

    <br>

    What do you reckon?<br>

    <br>

    Cheers,<br>

    <br>

    P.s Keep up with this fantastic work! :-)<br>

    <br>

    On 30/04/11 16:18, Stefan D&ouml;singer wrote:

    <blockquote cite="mid:201104301719.05754.stefandoesinger@gmx.at"

      type="cite">

      <pre wrap="">Hi,

Here's another update.

First I expanded my performance tests at <a class="moz-txt-link-freetext" href="https://84.112.174.163/~git/perftest">https://84.112.174.163/~git/perftest</a> 

a bit. The old tests were renamned to streamsrc_d3d and streamsrc_gl, and I 

added another set of tests that just tests the draw overhead without ever 

changing any states: drawprim_d3d and drawprim_gl. Here are the performance 

results from Windows 7:

drawprim_gl:        ~1154 fps

drawprim_d3d:        ~1160 fps

In Wine the D3D version gets 165.67fps fps. The Linux native GL version gets 

1791 fps. The GL windows version in Wine gets about 600 fps(FIXME!). Don't 

worry too much about the GL performance, this is mostly locking overhead. More 

about that later.

I ran my usual d3d performance hacks through the d3d version. The hacks are 

pretty much the same as with the stremsrc test, except that I don't need the 

redundant vertex shader apply hacks. I attached a tarball with the hacks and a 

file listing their performance impact.

The plan forward is still the same: Write more of those tests(especially tests 

that test non-draw stuff like resource loads), improve the tests and hope that 

real apps profit.

The optimistic scenario is that this works out. So far we've seen slow 

movementin real apps with the two fixes we've made(context_validate and FBO 

application, the latter isn't in Wine yet). This is expected to a certain 

extend, because the performance is reversely proportional to the number of 

performance bugs we have. So we'll have to remove a lot of them before we see 

big movement.

The pessimistic scenario is that those tests have nothing in common with the 

performance bugs in real apps and the fixes only end up making the code more 

complex.

To that end I think I'll create a github repo where I try to get the hacks 

into a somewhat usable state - not commitable to wine, but good enough that 

they don't break apps, so they can be tested against real world apps. That way 

we can find out how much they really improve real games without clogging our 

codebase without certainty that the changes help.

Here are again some descriptions of the hacks I tested:

2) End-user business, fairly harmless. Should always be used if performance is 

important

3, 4) Will break stuff. Can be fixed, but would be rather ugly. Probably 

interesting once we run out of easier fixes

5) Could go into Wine sooner or later. Does improve real games on its own 

already

6) Easy to clean up, I'll send a patch today. we can skip validation if FIXMEs 

are off since nobody will see them.

7) I tried to find out if removing one call level helps, but it doesn't even 

improve this locking overhead sensitive test app. Forget about it

8) Doable, but pretty uninteresting. I doubt we'll get a noticeable 

improvement in a real app

9-11) Distributor / End use choice. Note that some compiler flags(especially 

the framepointer one) can break apps and copy protection systems.

12) Distributor / End user choice too, but harmless. Not much gain compared to 

WINEDEBUG=-all though

13) Doesn't improve performance a whole lot once debug msgs are compiled out.

14) We should be able to limit calls to this functions to cases where the 

textures were changed or vertex texture fetch is used. We may be able to 

eliminate it entirely when we have enough samplers available

15, 16) I caution against too much optimism here. We won't be able to get rid 

of the locking anytime soon. Maybe the EnterCriticalSection / 

LeaveCriticalSection performance can be improved. A part of the problem is 

call overhead, but I think the biggest issue are the locked increment and 

decrement operations in RtlEnterCriticalSection / RtlLeaveCriticalSection.

Orig performance: 178 fps

Interlocked ops replaced with normal inc/dec: 244 fps

Lock calls removed from wined3d: 293 fps

(this is just to give you some idea where the time is spent)

17) Forget about this one until we run out of other optimizations

18) It's interesting how much this gives without all the other optimizations. 

My app doesn't use any textures, so this is just the call overhead and loping 

over the fragment samplers.

19) My app renders to a too small window, so swapchain render_to_fbo triggers. 

It's interesting that getting rid of it makes performance worse

21) Removing that and other checks in drawPrimitive() barely speeds up the 

test. I got a total of 7-8 fps out of the compatibility or error checks in 

drawPrimitive, this won't show up in any real app.

Stefan

</pre>

      <pre wrap="">

<fieldset class="mimeAttachmentHeader"></fieldset>

</pre>

    </blockquote>

  </body>

</html>