"optimized" assembly functions in wine

Tue Sep 21 07:57:39 CDT 2004

Hi,

Just did not feel like chasing bugs the other day. I decided to have
some fun with something that I wondering for a long time: the usefulness
of inline i86 assembly in string functions.

This is the test program as.c:

---------------------------------8<-------------------------------------
#include <malloc.h>
typedef unsigned short  WCHAR,      *PWCHAR;

static inline WCHAR *strcpyW( WCHAR *dst, const WCHAR *src )
{
#ifdef ASM
    int dummy1, dummy2, dummy3;
    __asm__ __volatile__( "cld\n"
                          "1:\tlodsw\n\t"
                          "stosw\n\t"
                          "testw %%ax,%%ax\n\t"
                          "jne 1b"
                          : "=&S" (dummy1), "=&D" (dummy2), "=&a"
(dummy3)
                          : "0" (src), "1" (dst)
                          : "memory" );
#else
    WCHAR *p = dst;
    while ((*p++ = *src++));
#endif
    return dst;
}

#define SZ 3000
main()
{
    int i;
    PWCHAR s,d;
    s=malloc(SZ*sizeof(WCHAR));
    d=malloc(SZ*sizeof(WCHAR));
    memset(s,'x',SZ);
    s[SZ-1]=0;
    for(i=0;i<1000000;i++)
        strcpyW(d,s);
}
---------------------------------8<-------------------------------------

The function strcpyW is a copy from Wine with the #ifdef modified.

I used the following commands

gcc-3.3 -O2  as.c -o as -DASM ; time ./as;time ./as; time ./as

and 

gcc-3.3 -O2  as.c -o as ; time ./as;time ./as; time ./as

The resulting times are (all user time):

test#   asm     C
-----------------------
1       15.970  15.899
2       15.966  15.943
3       15.959  15.941
        ------  ------
ave     15.964  15.928

Notes:
- tested on a PII 450 MHz;
- I tested with gcc 2.95 and 3.4.2 as well, result are essentially the
same.
- size of main() is 0x7a (assembly) vs 0x82 (C-code) bytes;
- I experimented with longer strings to see if there was any mem cache
hit/miss effects and found none.

Conclusions:

1. these routines are so fast that it is hard to imagine that these
functions will be a bottleneck, justifying such optimization;
2. nothing shows here that inline assembly brings any advantage.

Rein.
-- 
Rein Klazes
rklazes at xs4all.nl