Unicode question

Sat Dec 7 07:17:43 CST 2002

Rolf Kalbermatter wrote:

>Hello,
>
>In trying to get shell32 a little bit more Unicodified I came across this
>function ParseFieldA which is taken from shellord.c. I'm quite unfamiliar
>with Unicode so I still have to learn a lot.
>
>I have finally found most of the string manipulation functions which work
>for Unicode but when it comes down to simple character comparison I'm a
>little bit in the dark here.
>
>Some code snippets elsewhere in wine make me believe that for the english
>charset WCHAR == char is actually mostly true. However I wonder if this
>can be relied on in code. For instance the Unicode version of ParseField
>would in that case look like this but I really want the opinion of someone
>else on, if the code
>
>if (*src++ == ',') nField--;
>
>is actually working as expected on all systems independent of the actually
>used charsets for the local languages.
>

It's ok to compare a WCHAR with a known char ('A'), but not two WCHARS 
together.

Explanation - We (as well as Windows) use UTF-16 (UCS-2?) to represent 
characters. Most common Unicode characters in Europe, Africa, America, 
Australia and the middle east fit nicely into this area, and there are 
no problems. Eastern Asia, and some other characters, however, don't.

The characters that don't fit in are represented using Surrogates - i.e. 
- each character takes two WCHARS to represent. The Unicode standard has 
been very wise in selecting the surrogates, however. Both first and 
second WCHARs of any given surrogate are taken from a range that is not 
allocated for any other character of Unicode. This means that if you are 
looking for a Hebrew "Aleph", scanning with a piece of code that looks 
something like:
while (*str++ != 0x5d0)
is guaranteed not to match anything except "Aleph". This means that if 
it's a specific character you are looking for, and you know it's not a 
surrogate, your code will work.

However!
If you are trying to look for an occurance of one character inside a 
string, and neither string nor character are known to you at the time of 
writing the code, this technique may fail miserably. The reason is that 
if the character you are looking for is a surrogate, both first and 
second WCHARs may appear, seperately, in other chars (all surrogates 
themselves, but still).

Bear that in mind, and everything will be ok.

                Shachar