Re: [Snowball-discuss] a simple algorithm problem

From: Martin Porter (
Date: Mon Dec 13 2004 - 17:41:23 GMT


Well, your example word is


where * is the two byte sequence C4 B1 (hex),
                              or (110)0100 (10)110001 (binary)

which is the utf-8 encoding of 01000110001 (binary) or 131 (hex), which is
the Unicode character for a dotless i.

In other words, you think of it as one character, which in Unicode it is,
but Snowball thinks it is two characters, because it occupies two bytes.

You can run Snowball in 16-bit character mode and so represent the Turkish
alphabet in Unicode. But the special characters you are defining suggest
that you might be trying to get the stemmer working in 8 bit ASCII with
iso-latin 1 extensions.

My inclination would be to get it going as an 8-bit per character program
and worry about Unicode later.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST