[Snowball-discuss] Re: snowball

From: Olly Betts (olly@survex.com)
Date: Sun Oct 06 2002 - 13:58:01 BST

On Sun, Oct 06, 2002 at 02:02:54AM -0600, Martin Porter wrote:
> Hello, Olly - nice to hear from you after so long. Hope all is well.

Sorry I've not kept in touch - I always seem to be busy these days.
Things are well, though a bit more paid work would be useful...

> At 02:18 AM 10/6/02 +0100, Olly Betts wrote:
> >I wrote to Richard:
> > > BTW, I've just rewritten omstem.cc to use snowball. I need to think about
> > > how the build systems will fit together now, but I thought it worth
> > > telling you just in case you were about to embark on this too.
> This is quite pleasing, since the stemmers in Omsee (=Xapian) are certainly
> one generation behind the Snowball stemmers, and slightly inferior. One
> thing I'm conscious of (since I keep hitting it in search engine lists) is
> the documentation about using and writing stemmers which I put into Xapian.
> I don't agree with all I said at the time! I'd be inclined to eliminate all
> that and have pointers from Xapian to the Snowball pages as a way of
> covering the use of stemmers.

Then you'll be glad to hear that I've already removed the "how to write
a stemming algorithm" document and replaced it with:

   We'd like to add stemmers for other languages too - see the Snowball
   site for information on how to contribute.

The document describing what a stemming algorithm is and your original
paper are still there, though it may make sense to remove those too
as you suggest. I'm not quite sure how the two projects slot together
yet - there's something to be said for a consistent set of
documentation, especially for the novice user, so perhaps there should
be at least part of the introduction to stemming left.

Incidentally, what is the Snowball project's scope? Do you intend to
supply other related stuff? Such as stopword lists, accent normalisation code
(e.g. conflating ä and ae in German), language recognition, etc.

> >> Good news: I now have Xapian with snowball stemmers.
> >>
> >> Bad news: It fails all the dictionary tests.
> Well the stemmers were significantly redone between Xapian and Snowball.
> This is a nuisance, but the Snowball stemmers are certainly superior.

OK, the changes are intentional then. This isn't a problem as such, but
means I need to think carefully about when to slot the snowball stemmers

> The problem of database support came up a lot when we were working for the
> Company, acting as a strong pressure to keep the stemmers "frozen". In
> Snowball the stemmers change more often as native speakers spot problems. I
> think all you can do is incorporate new stemmers en bloc from time to time
> with new release numbers for the system as a whole. So a change of stemmer
> is equivalent to a change of underlying datastructure.

We can indeed be more dynamic now, but we still want to avoid causing
confusion. I'll check to see if anything else is likely to change which
will force a database rebuild - that would be a good time to change the
stemmers as well.

> >I also notice that in Xapian, the french voc.txt contains lines with
> >invalid accents:
> >
> >french/voc.txt:pas^a
> >french/voc.txt:s^a
> >french/voc.txt:son^a
> >french/voc.txt:twas^a
> >
> >Those would be s-acute and n-acute! I wonder if these are meant to be
> >a-circumflexes? So "pasâ", "sâ", "sonâ", "twasâ", though I can't find
> >any of these in a french dictionary.
> >
> >Interestingly, the snowball french voc.txt also has these 4 in, but all
> >the other (valid) accents have been translated!
> Some of the vocabulary lists are left "scruffy" with non-words and weird
> characters - it helps test the stemmers for obscure bugs. (Or it used to: I
> don't the newer work benefits much by this approach.)

Note that Xapian already has specific tests which feed the stemmers
random input (one test uses word-like garbage, another random binary
garbage) to check they don't crash or hang. The snowball stemmers
all pass those tests.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST