At last, some developments in Snowball!
I've put in a switch, -u or -utf8, to generate stemmers that handle UTF-8
encoded Unicode characters. Full documentation will follow, although from
the outside little is different. The ISO-Latin-I sources of the Roman
alphabet stemmers are the same; the Russian stemmer has a stem-Unicode.sbl
variant.
Some of the stemmers needed small adjustments. If p marks a position in the
string,
... setmark p ...
the old test
$p > 3
to see if p is beyond the first three characters no longer applies, since
the number in p is a byte offset from the start of the string, not a
character offset. Instead you need something like
... hop 3 setmark x ...
... setmark p ...
and later
$p > x
So marks should be tested relative to other marks, and not against absolute
numeric values. 'size' still measures the byte size of a string, not the
character size.
The same sources can be used to generate UTF-8 and ISO-Latin-1 encodings so
long as code values are defined in hex, e.g.
stringdef a^ hex '83' // a-circumflex
but obviously if UTF-8 sequences occur inside literal strings in the
snowball source scripts, you can only use them to generate stemmers for
UTF-8 encoded text.
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST