Re: [Snowball-discuss] 8-bit and 16-bit characters support

From: Martin Porter (
Date: Wed Jun 04 2003 - 09:21:02 BST


I was really thinking aloud. I would need to rewrite the snowball scripts to
use 'among's rather than character groups. 'goto vowel' was just a way of
illustrating the problem.

The way to make it work with utf-8 encoded data is to put the unicode
Russian characters into 2 byte form before calling Snowball, and then repack
as utf-8 afterwards. Tedious, I know.

I said 2 or 3 byte characters because in utf-8, a character value above 127
packs into either 2 or 3 bytes. Is that not so?

I will look at http address you sent.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST