UCS2 vs. UTF16 question

Mon Oct 18 10:00:26 CDT 2004

--- Shachar Shemesh <wine-devel at shemesh.biz> wrote:
> I took the liberty of answering to the list. I hope
> you don't mind.

Not at all.

> I most certainly didn't say that. I may have
> mentioned UCS4, but to the 
> best of my knowledge at the time, Windows uses
> UTF-16.

Ah.  It might have been Chris Hertel that said that
then.  The samba folks may see that on the wire.

> >WCHARs are in fact fixed-width in Windows?
> As far as I know, they are not. Sorry.

Okay.  That's fine.  I'm just trying to understand the
encodings correctly.

> >I'm planning to write a tool to detect the
> following
> >problematic bit of code:
> >char str[] = "hi", *p = str + sizeof(str) - 1;
> >p--;
> >At least, it's problematic when str contains
> >double-byte characters.
> >
> I'm not sure what you are aiming at achieving. Are
> you trying to hit the 
> beginning of the last character of the string? If
> so, then you do, 
> indeed, have a problem here.

Yes, that's what the code's doing.  I'm actually doing
a research project for a class.  My project partner
and I are thinking of using static analysis to detect
this sort of bug.  We can probably just use lexical
analysis to detect other bogus things, like strchr and
strrchr.  We're thinking some tools like this might
help catch some internationalization bugs.

> In the past I have written programs that had to do
> MBCS (the non-unicode 
> Japanese encoding). This is an encoding in which
> some characters are one 
> byte, and some two. The best I could come up with
> was to build a wrapper 
> around std::string that had two bytes per character
> internally. When you 
> loaded a string, it would check character by
> character for whether it's 
> a double byte, and then have each string location
> contain exactly one 
> character. This allowed random access, as well as
> both forward AND 
> backwards scanning.

That seems reasonable.

> Fortunately, UTF is much better than MBCS. Given a
> byte in either UTF-8 
> or UTF-16, it's fairly easy to figure out whether
> it's part of a 
> surrogate, and what part. If you have assurance that
> the string you are 
> handling is a well formed one, you can do backward
> scans of a UTF string 
> fairly easily.

Indeed.  Like you said, it's the MBCS/DBCS encodings
that are particularly bad in this respect.

> Do you want a gmail account?

Got one, haven't used it much yet.

Thanks,
--Juan

_______________________________
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now. 
http://messenger.yahoo.com