Re: [Snowball-discuss] Regarding a unicode version of Snowball

From: Martin Porter (
Date: Wed Nov 28 2001 - 08:18:40 GMT


I should point out that we have something rather less than a "development
team". We have me, and Richard Boulton who mainly helps with the Web site :-)

Yes, it says in the manual that "at some point Unicode characters will have
to be supported". I have given this some thought since receiving your email,
but before going further would like to ask you: which do you think is a more
convenient representation (not just for you but for Unicode users
generally)? (a) Two bytes per character, so that 'char *' is replaced by
'short *', and you are still handling an array of characters, although the
size of the elements in the array has changed, or (b) a UTF-8 encoded form,
where characters below 128 are held in 1 byte, and other characters are held
in a variable number of bytes?

In the case of (a), which way round are the bytes? I assume the more
significant is first, so "ab" would become "\0" "a" "\0" "b".


Snowball-discuss mailing list

VirusChecked by the Incepta Group plc

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST