Re: [Snowball-discuss] 8-bit and 16-bit characters support

From: Martin Porter (
Date: Wed Jun 04 2003 - 07:02:04 BST


No, Snowball is either set up for 1 byte character use, or 2 byte character
use, but it has occurred to me that implementing the stemmers on utf-8 data
may not be so difficult, even with no changes to the Snowball compiler.

If you treat utf-8 data as a pure byte stream of characters (so one utf-8
character corresponds to 2 or 3 bytes) the stemmers almost work, but the
thing that goes wrong is the single character tests for characters in a
certain class. So one would have to replace

    goto vowel // vowel defined by 'define vowel '...'


    goto among ('a' 'e' 'i' 'o' 'u')

or more precisely

    goto among ('[a]' '[e]' ... )

where [a] etc are macros defining the vowels as utf encoded byte sequences.

Perhaps that is how all the stemmers should have been written.

Can you point me to some plain text somewhere in the web that gives a bit of
russian in utf-8 encoded Unicode ? I might play around with this idea.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST