AUTHORS list and the C locale on Mac OS X

Tue Nov 9 21:02:31 CST 2010

On 11/9/10 7:58 PM, James McKenzie wrote:
> On 11/9/10 3:29 PM, Reece Dunn wrote:
>> On 9 November 2010 22:13, Charles Davis<cdavis at mymail.mines.edu>  wrote:
>>> On 11/9/10 1:58 PM, James Mckenzie wrote:
>>>> Charles Davis<cdavis at mymail.mines.edu>  wrote:
>>>>> On 11/9/10 12:13 PM, James Mckenzie wrote:
>>>>>> No, it is not a bug in GNU sed.  The authors.c file needs to have
>>>>>> the erroneous characters for the language used by
>>>>>> MacOSX changed to be acceptable?
>>>>> That ain't gonna fly. I think we should explicitly use a UTF-8 locale
>>>>> (like en_US.UTF-8 or some such) instead of the C locale when sed goes
>>>>> over the AUTHORS file.
>>>> Don't shoot the messenger.
>>> Sorry.
>>>
>>> The problem with your first idea--removing the bad characters directly
>>> from the authors.c file--is that we'd need to use a utility like sed or
>>> awk to implement it automatically--which puts us right back where we
>>> started. (We could use diff/patch, but is it worth the effort to
>>> maintain a patch for this? And would AJ let us put the patch file in
>>> Wine? And if not, where would we put it?)
>>>>   Maybe we can force the use of sed if it exists in the /usr/bin
>>>> directory then to get around the 'brokenness' of GNU sed on the Mac?
>>> Maybe. But that seems like a hack. A better way might be to detect if
>>> we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed.
>>> That's less of a hack, but still a hack.
>>>>   If not, it is a real bear to set the language on a Mac per
>>>> previous discussions on the Users list.
>>> That was about setting LANG. Wine always obeys LC_*, and so does sed.
>>>
>>> It's not the language that's the problem. It's the encoding. The AUTHORS
>>> file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told
>>> it not to (i.e. we told it to use MacRoman because that's the default
>>> encoding for the C locale). If we tell it to use UTF-8 (by setting
>>> LC_ALL to, for example, 'en_US.UTF-8'), it will process the file
>>> correctly.
>>>
>>> Unfortunately, I just remembered that the name of the UTF-8 encoding is
>>> different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us
>>> from setting LC_ALL differently. We might end up having to hack around
>>> this the way either you or I described.
>> You could use autoconf to detect:
>>    1/  broken handling of UTF-8 characters by sed;
>>    2/  name of LC_ALL flag that handles UTF-8
>>
>> NOTE: You will need to enumerate available locales as the user may not
>> have en_US present with UTF-8 encoding (e.g. a Spanish-only or
>> Chinese-only system).
>>
>> Something like:
>>
>> cat>  get_locale.sh<  EOF
>> locale -a | while read locale ; do
>>     if [[ LC_ALL=$locale sed<  authors.c>  /dev/null ]] ; then
>>        echo $locale
>>        exit
>>     fi
>> done
>> EOF
>>
>> This should print a locale that can process the UTF-8 file. It needs
>> cleaning up a bit, but that is the basis of it.
>>
> Thanks Reece.
> 
> Charles:  You want to do this?
I'm on it.

If you have a patch ready, though, go for it.

Chip