Re: [Snowball-discuss] Unicode representation (formerly: Regarding a unicode version of Snowball)

From: Martin Porter (
Date: Thu Dec 06 2001 - 12:06:17 GMT

James Aylett was asking whether the Unicode issue for Snowball is one of
internal representation, and the answer I think is that it is not.
Snowball's internal representation currently is one byte per character, and
you can define your own coding scheme. The question is really for the API:
if people want to hand over strings for stemming in a Unicode form, how
exactly will these strings be encoded? I imagine Snowball can be extended to
any kind of encoding. It's clearly silly making it handle strings in UTF-16
form if everyone is using strings in UTF-8 form and vice versa.

Of course perhaps people will want to hand over strings in all sorts of
forms, in which case you need to convert before processing, but I wasn't
expecting that to be necessary.

In any case there is perhaps no need to continue the discussion at present:
there is no immediate demand for unicode support.

(Generally, it seems to me that Unicode is fine if you are based in the
Latin alphabet and want to make occasional excursions into exotic
characters, and also appropriate for Chinese and Japanese, but to represent,
say, Cyrillic in Unicode in UTF-8 form seems terribly clumsy. Do Oleg and
Andrei have a view on this?)


