[Snowball-discuss] UTF-8 support

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Mon May 23 2005 - 16:02:47 BST


At last, some developments in Snowball!

I've put in a switch, -u or -utf8, to generate stemmers that handle UTF-8
encoded Unicode characters. Full documentation will follow, although from
the outside little is different. The ISO-Latin-I sources of the Roman
alphabet stemmers are the same; the Russian stemmer has a stem-Unicode.sbl
variant.

Some of the stemmers needed small adjustments. If p marks a position in the
string,

... setmark p ...

the old test

    $p > 3

to see if p is beyond the first three characters no longer applies, since
the number in p is a byte offset from the start of the string, not a
character offset. Instead you need something like

... hop 3 setmark x ...
... setmark p ...

and later

    $p > x

So marks should be tested relative to other marks, and not against absolute
numeric values. 'size' still measures the byte size of a string, not the
character size.

The same sources can be used to generate UTF-8 and ISO-Latin-1 encodings so
long as code values are defined in hex, e.g.

    stringdef a^ hex '83' // a-circumflex

but obviously if UTF-8 sequences occur inside literal strings in the
snowball source scripts, you can only use them to generate stemmers for
UTF-8 encoded text.

 



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST