[Snowball-discuss] Unicode

From: Martin Porter (martin_porter@softhome.net)
Date: Fri Feb 22 2002 - 17:35:49 GMT

Unicode is on its way, and I'll outline the proposed solution:

For input I think I'll alter the syntax of hex strings so that internal
spaces separate characters. This is not upwards-compatible, but I can't
imagine anyone will be upset.

So hex '0D0A'
needs to be written hex '0D 0A'
or even hex 'D A' // leading zeroes can be omitted

A new style 'decimal' will be introduced:

                       decimal '13 10' // cr lf

Then we allow all values from 0 to 64K-1. values >= 64K produce an error

For output, the java case is not a problem, since strings are made up of 16
bit items anyway.

For ANSI C we'll have 3 output styles:

1) 8 bit characters, when reference to a character > 255 is an error. This
is the default style for output in the ANSI C case, and is what we have at
the moment.

2) 16 bit characters. The way to get this is to declare all strings in the form

    static symbol string_37[] = {'f','r','e','d'};
    static symbol string_38[] = {'h','a','r','r','y'};

and we typedef symbol to 'unsigned short'. When it is typedeffed to
'unsigned char' we get case (1) again. Of course any of the characters 'f'
etc may be replaced by a number > 255 to get Unicode characters.

3) UTF-8 encoded 8 bit characters. I believe the only change to the
generated C is that cursor movements of the form z->c++; and z->c--; need to
be replaced by function calls that move over 1,2 or 3 bytes to get to the
next character.

- - - - - -

Does anyone know of a program of the form

     convert <input >output -option

where option could be ISOLatin1_to_Unicode, Unicode_to_Windows etc etc? I'll
have to put something together like this for test purposes.


Snowball-discuss mailing list

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:41 BST