Re: [Snowball-discuss] Regarding a unicode version of Snowball

From: Martin Porter (martin_porter@softhome.net)
Date: Wed Nov 28 2001 - 08:18:40 GMT


Archie,

I should point out that we have something rather less than a "development
team". We have me, and Richard Boulton who mainly helps with the Web site :-)

Yes, it says in the manual that "at some point Unicode characters will have
to be supported". I have given this some thought since receiving your email,
but before going further would like to ask you: which do you think is a more
convenient representation (not just for you but for Unicode users
generally)? (a) Two bytes per character, so that 'char *' is replaced by
'short *', and you are still handling an array of characters, although the
size of the elements in the array has changed, or (b) a UTF-8 encoded form,
where characters below 128 are held in 1 byte, and other characters are held
in a variable number of bytes?

In the case of (a), which way round are the bytes? I assume the more
significant is first, so "ab" would become "\0" "a" "\0" "b".

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST