[PATCH] jscript: Ignore BOM mark in next_token. (try 5)

Thu Oct 2 05:48:56 CDT 2014

On 10/02/14 08:29, Nikolay Sivov wrote:
>>   static BOOL skip_spaces(parser_ctx_t *ctx)
>>   {
>> -    while(ctx->ptr < ctx->end && isspaceW(*ctx->ptr)) {
>> +    while(ctx->ptr < ctx->end && (isspaceW(*ctx->ptr) || *ctx->ptr
>> == 0xFEFF /* UTF16 BOM */)) {
>>           if(is_endline(*ctx->ptr++))
>>               ctx->nl = TRUE;
>>       }
> This looks correct according to ECMA-252 section 7.2 - all of the
> following is a whitespace:
>
> - tab and vertical tab, 0x9 and 0xb;
> - form feed 0xc
> - space 0x20
> - NBSP 0xa0
> - UTF-16 BOM 0xfeff
> - any other Unicode "space separator"
>
> Hopefully isspaceW() covers everything but the BOM. What worries me is
> that isspaceW() itself is used in numerous places in code on its own.
> So probably we need more tests to cover more cases where space
> separators could be used, and later have our own is_space() call that
> will conform to the standard.

FWIW, ECMA-262 (which I usually use for jscript development) doesn't
mention UTF-16 as white space. Anyway, I agree that it would be
interesting to see if it's considered white space in other places as
well. (I'm also fine with the patch in current form, but an extended
version would be obviously better).

Jacek