Re: [Snowball-discuss] a simple algorithm problem

From: Martin Porter (
Date: Tue Jan 04 2005 - 10:35:13 GMT


Thank you for the sample Snowball script. (I would have defined the utf-8
characters through stringdefs, following section 3 of the Snowball manual).

You ask if there is any way around 'size' being wrong when utf-8 characters
are included. This is all part of the problem of character representation.
The best solution would be an addition compilation mode for utf-8, but of
course that would mean yet another elaboration of Snowball itself ...

Right now there is 8-bit working, and 16-bit working. 16-bit working fits in
well with the Java codegenerator scheme, and can be used with the ANSI C
codegenerator, although it has given rise to confusion on at least one

In Snowball the concept of 'character' only turn up in a few contexts:

a)hop N - to hop forward N characters
b)next = hop 1
c)goto C
d)gopast C - where you keep doing a 'next' until C is successful
e)size - counts the number of characters

and in 'groupings'. In retrospect, I occasionally wish groupings were not in
the language. Instead of

A) define vowel 'aeiou'

one could have

B) define vowel as among('a' 'e' 'i' 'o' 'u')

(A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is
faster than (B), but optimisation in the codegenerator could turn (B) into a
bitmap as well. There are other differences however: (B) needs to be defined
in a 'forward' or 'backward' context; non-vowel is a neat test that works
with style (A) but not (B).

If groupings were NOT in the language, you could reduce the difference
between utf-8 and single character working to the definition of a couple of
macros PREV and NEXT (thinking of ANSI C codegeneration) that move the
character cursor left or right by one place, and that only turn up in the
definitions of (a) to (e) above.


I have rather mixed feelings about utf-8. It is of course in widespread use. It is especially convenient for languages using Roman letters with certain extensions (for example, your Turkish following the 1928 reforms of Ataturk). But it seems to me to be singularly clumsy for languages based on other alphabets (Russian, Greek, Arabic).


At 21:31 31/12/2004 +0000, ayhan peker wrote: >Martin hi, >I have made some changes. >It looks like if everything is in utf-8 you dont need to do string >definitions at all. >The algorithm works as it is except that size is wrong. As you said >"Snowball thinks it is two characters". >Is there a way round it? >About turkish stemming in mtu. I knew they were working on it. I wish >they put something up more concrete (the code). It might very well be >all in theory. > >Ayhan >btw. Happy Christmas and happy new year. > >The code: > > >routines ( > mark_regions > R1 > common_suffix > >) >externals ( stem ) >integers ( p1 p3) >groupings ( v all ) >stringescapes {} > >/* special characters (in turkish) */ > >stringdef u" hex 'FC' // u w�th d�aer�es >stringdef i^ hex 'FD' // >stringdef o" hex 'F6' // >stringdef s, hex 'FE' // >stringdef c, hex 'E7' // >stringdef g^ hex 'F0' // > >define v 'aeiouüöı'//{u"}{o"}{i^}' >define all >'aeiouüöışçğqwrtyplkjhgfdszxcvbnm1234567890!£$%^&*()-_=+[]@~;:/?><#.' >define mark_regions as ( > $p1 = limit > > $p3=size > do ( > ( gopast v gopast non-v) setmark p1 > > > ) > >) >backwardmode ( > define R1 as $p1 <= cursor > > > > define common_suffix as ( > [substring] among( > 'ler' 'lar' 'diler' 'dular' 'dılar' 'düler' > 'tiler' 'tular' 'tılar' 'tüler' 'dir' 'dır' 'miş' 'mış' 'müş' >'muş' 'mişler' 'mışlar' 'müşler' 'muşlar' > (R1 delete) > ) > ) >)

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST