Re: [Snowball-discuss] a simple algorithm problem

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Thu Jan 06 2005 - 09:12:34 GMT


Olly,

I would answer 'yes' to all the points made in your last email. Thanks for
reporting the error in Snowball manual. It will be fixed in the next cvs commit.

There are various ways to proceed ...

In the 2-byte character version of Snowball (standard for the Java
codegenerator) you can define characters as decimal or hex numbers in the
range 257 to 64K. These characters can go into character tables, which are
implemented as bitmaps. Of course, working with 256 characters, the bitmap
never exceeds 32 bytes -- and will frequently be less, since the bitmap is
truncated at both ends by removing runs of zeros.

Working with 64K characters, a bitmap might go up to 8K in size, which is
not an intolerable overhead. In practice they are much smaller, since the
codes we need in the stemmers do not have high Unicode values.

So one idea is to declare 'utf8' in the Snowball script, allowing character
defs in the range 0-64K, as in the 2-byte character version. Characters
could be written with their Unicode values, and encoded in utf-8 form in
strings.

Looking again at the cursor movement issues:

If I've got my thinking right: in goto and gopast, cursor movement can be
done one byte at a time, (cursor++; or cursor--;). If expression C is made
up of well-formed utf-8 strings. 'goto C' must either fail, or end on a
valid character boundary.

'next' requires its own implementation, and as you say, backward movement is
not a problem.

'next' is implicit in character tests (vowel, non-vowel), and hop N (=
'next' done N times).

I'm not sure 'size' is used in the Snowball scripts: it could be defined to
give an approximate answer in the utf-8 case, or implemented exactly.

Presumably a 'utf8' declaration would simply be ignored by the Java
codegenerator (Richard B to confirm).

Obviously we are working towards a standard header:

utf8
define GREEK_CAPITAL_LETTER_OMICRON hex 039F
. . . .

of Unicode characters, and it would be nice to use the Unicode names, were
they not (as this example shows) so very long.

- - -

Another idea I had was just to create modified versions of the existing
scripts so they will work with utf-8 encoded strings, even while Snowball
knows nothing about utf-8. That could be done with no further changes to
Snowball.

Incidentally, do you have a view on the use of free-floating accents?
(Unicode 0300-036F)

Martin



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST