[Snowball-discuss] Re: snowball

From: Martin Porter (martin_porter@softhome.net)
Date: Sun Oct 06 2002 - 09:04:01 BST


Hello, Olly - nice to hear from you after so long. Hope all is well.

At 02:18 AM 10/6/02 +0100, Olly Betts wrote:
>Hi Martin!
>
>I wrote to Richard:
>> > BTW, I've just rewritten omstem.cc to use snowball. I need to think about
>> > how the build systems will fit together now, but I thought it worth
>> > telling you just in case you were about to embark on this too.
>>

This is quite pleasing, since the stemmers in Omsee (=Xapian) are certainly
one generation behind the Snowball stemmers, and slightly inferior. One
thing I'm conscious of (since I keep hitting it in search engine lists) is
the documentation about using and writing stemmers which I put into Xapian.
I don't agree with all I said at the time! I'd be inclined to eliminate all
that and have pointers from Xapian to the Snowball pages as a way of
covering the use of stemmers.

>> Good news: I now have Xapian with snowball stemmers.
>>
>> Bad news: It fails all the dictionary tests.

Well the stemmers were significantly redone between Xapian and Snowball.
This is a nuisance, but the Snowball stemmers are certainly superior.

>part of the problem is that the Xapian test data has accents
>written as "e^a", etc.

Yes, e^a for e-acute was a Muscat leftover. In Snowball, of course, funnies
are expected to appear as single codes, with the assigment of codes easy to
alter, but with ISO Latin I (or Unicode) assumed. Personally, I've never
liked the idea of a composite character not being represented by composite
codes, but you have to go with the trend on this one!

>Fixing that though, I find *all* the Snowball stemmers (apart from
>finnish, russian, and lovins which are new) give slightly different
>results to the corresponding Xapian ones (and IIRC, the Xapian ones gave
>the same result as the Muscat 3.6 ones, except for one or two very minor
>differences in the "porter" stemmer - "yacht" comes to mind).

[IIRC = if I recall correctly?? took me a few minutes to guess that one]

Russian was in Muscat 3.6, but was never used by the Company. In fact the
Muscat 3.6 and Xapian stemmers all differed.

>Is there a reason why the Snowball ones are different? I can well
>believe the differences are arbitrary, but it means that we can't
>drop them in without breaking databases people have built with the
>current stemmers...

Essentially, developing Snowball involved describing all the stemmers
algorithmically - like the Porter stemmer. This led to bug discoveries and a
better understanding of how the how the stemmers should be designed; the
internal exception lists (irregular verbs) were removed; the codesets were
changed etc. For example, in the Xapian French stemmer, all French accents
are removed. In Snowball, only those accents are removed which change
according to the ending attached to the word.

The problem of database support came up a lot when we were working for the
Company, acting as a strong pressure to keep the stemmers "frozen". In
Snowball the stemmers change more often as native speakers spot problems. I
think all you can do is incorporate new stemmers en bloc from time to time
with new release numbers for the system as a whole. So a change of stemmer
is equivalent to a change of underlying datastructure.

   
>I also notice that in Xapian, the french voc.txt contains lines with
>invalid accents:
>
>french/voc.txt:pas^a
>french/voc.txt:s^a
>french/voc.txt:son^a
>french/voc.txt:twas^a
>
>Those would be s-acute and n-acute! I wonder if these are meant to be
>a-circumflexes? So "pasâ", "sâ", "sonâ", "twasâ", though I can't find
>any of these in a french dictionary.
>
>Interestingly, the snowball french voc.txt also has these 4 in, but all
>the other (valid) accents have been translated!

Some of the vocabulary lists are left "scruffy" with non-words and weird
characters - it helps test the stemmers for obscure bugs. (Or it used to: I
don't the newer work benefits much by this approach.) But I think I'll
eliminate those oddities in the French vocab at some point so thanks for
pointing them out.

The a-circumflex is some punctuation character: closing quote or something.
(pas and son are French words, and 'twas is an English word - I recall the
original texts had a few poetic quotes in English.)

Martin



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST