Dimitrie Paun was kind of unhappy with Wine's current string
support. As you may already know, most of 32 APIs come into two
flavors: ANSI and Unicode. API suffixed with 'A' are ANSI, and the
ones with 'W' are Unicode. Being ANSI (resp. Unicode) express how the
function must handle any string input or output parameter. So, the
same function, say CreateWindow, come in two flavors CreateWindowA and
CreateWindowW.
Microsoft uses the same convention (a #define UNICODE triggers the
Unicode mode at compile time).
ANSI means a one byte per character coding, whereas Unicode implies
several bytes (at least two, but some are escapes to longer
sequences). Even if Unicode consumes more memory, it also allows to
store strings for various languages: most of non textual languages
(Japanese, at least in Kanji or Chinese, most of cyrillic alphabets,
as Russian... but also some other European languages, with specific
diacritics).
Ove Kåven gave an overview of the different encodings:
ASCII: 7-bit, one byte per character
ISO 8859 encodings, ordinary SBCS codepages: 8-bit (often extended
ASCII), one byte per character. (Note: All the ISO Latin
1,2.... follow this scheme
)
Asian languages, DBCS codepages: 8-bit; either one or two bytes
per character (if the first byte is a "lead byte", it's a two-byte
character).
UTF-16: Unicode encoding, two bytes per character (preferably
big-endian but I doubt MS cares). May employ surrogate pairs (two
UTF16 characters in reserved ranges) to encode Unicode characters
beyond the first 64K; the surrogate pairs allow access to 1M more
characters (may be necessary for very exotic Asian languages, but no
such characters are defined yet).
UCS2: Unicode encoding, two bytes per character, but not
surrogate pairs.
UCS4: Unicode encoding, four bytes per character, easily and
conveniently encodes the full Unicode set. This is what GNU systems
prefer, since they don't want to deal with surrogate pairs.
UTF-32: Same as UCS4, just defined by different organizations
(UCS4 is ISO, UTF32 is Unicode Consortium, plus the added restriction
of that no more than 64K+1M different characters may exist in UTF32).
UTF-8 (UTF-FSS): Unicode encoding useful for compatibility with
software written for 8-bit C strings. Variable-width (between 1 and 6
bytes per character). Lower 128 characters are encoded as plain ASCII.
UTF-7: Unicode encoding for compatibility with software written
for 7-bit characters (email, news, etc). A hybrid of Base64 and
Quoted-Printable.
In the rest of this article, W will refer to UTF-16 strings or
functions, and U to UTF-8 strings or functions.
Currently, as Dimitrie points out, most of the Wine code is poorly
written with regard to Unicode: most of the W functions convert the
string into an ANSI one, and then call the A function, implying a loss
of information, and some potential bugs.
Dimitrie proposed to change Wine's style for coding by providing a
unique function (let's say suffixed by 'X') which would be the work
horse for both A and W functions.
Dmitry Timoshkov didn't like this proposal, and rather suggested to
Wine should have only one functional implementation
indeed. I think, it should be implemented like in NT: all actual work
does Unicode version, ANSI version simply converts ANSI to Unicode and
then calls Unicode workhorse. But this transition will consume a lot
of time and efforts.
Dimitrie Paun went further with:
Somehow, I don't think working with W is the right thing to do in
Unix. We have the following situation: we receive strings as
arguments; their encoding is not explicit with every string, but
rather is implicit by the entry point. Now, we can do two things:
[eager] convert at the entry point in one common format, and
carry on in with one internally with that format
[lazy] remember the encoding that the strings are in, and pass
that around until we actually need a specific encoding
Anyway, I like 2 better than 1. Not committing to an encoding early in
the game is good -- sometimes we need UTF8 (file systems, X), in other
cases we need UTF16 (pure Win stuff). Moreover, the thing is scalable
-- if another encoding comes along, we could easily support it. And,
on top of it all, it should be more efficient.
With lots of discussions and contributions from many people, the
following table has been built:
|
Description
|
Pros
|
Cons
|
1 |
W->A conversion, work internally with A |
best option for debugging
fast for A (common
case today)
use std. Win API
|
we do NOT support Unicode, we just pretend we do(1)
a
lot of work, a lot clutter, close to no gain.
inefficient for
the W case
|
2 |
A->W conversion, work internally with W |
full Unicode support
fast for W
use of std. Win
API
part of Wine is already written this way
|
a lot of clutter
very inefficient in the A case
(A->W->U usually)(2)
|
3 |
A,W call onto a X function which carries the encoding
around |
full Unicode support
as fast as 1 for A, and as 2
for W (for common code path like display)
support for new
encodings is trivial
not much worse than 2 for
debugging
maybe a bit less clutter than in 1 or 2
(debatable)
easy transition from what we have to this
|
use of non std. Win API: this doesn't work across DLLs
(would require new APIs)
it is not used in Wine
currently
test coverage of all possible paths can be
huge
|
4 |
Write all functions independent of the encoding and
recompile to get all encodings (same .c file would generate .Ansi.o,
.w.o object files |
fastest option for A, W
easy to support future
encodings
use of std. API
less clutter (in theory)
|
huge bloat
it is not used in Wine
currently
(maybe) difficult transition path
|
Notes:
Patrik Stridvall modified his winapi_check tool to list the
cases where W->A conversion was used. At least 172 suspect
functions have been reported.
Alexandre Julliard pointed out that converting
A->W->U for file I/O may seem wasteful but it isn't really since
we need to support code pages; you can only do A->U directly for
7-bit ascii which is not enough. And supporting code pages without the
Unicode step means N^2 conversion tables instead of 2*N (where N is
the number of code pages).
Since Alexandre's preferred approach is #2, it was the chosen
one. However, lots of arguments, mainly between Dimitrie Paun and
Patrik Stridvall flooded wine-devel to such an extent that some
readers thought they were reading linux-kernel mailing list.
Patrik also proposed to automate some of the A->W or W->A
conversions so that stubs for some functions could be generated from
the .spec file. This didn't work out as, because there are different
options to take care of:
strings can be input, output, or input/output string
being a NULL string can be an error or a normal parameter
string can be 0 terminated, of fixed length...
in some cases (like resources), strings represent IDs (if
HiWord is 0)
Semantics seemed too complex to really provide a robust framework.
As a conclusion, Wine internal string encoding shall (slowly) shift
from Ansi to Unicode (UTF-16).
|