[Snowball-discuss] Out of date diffs?

From: J Smith (jsmith@tutorbuddy.com)
Date: Wed Nov 27 2002 - 00:14:01 GMT


I finally got around to updating the stem-php extension for PHP (which is
obviously based on Snowball) when I noticed that two stemmed vocabulary files
seem to be out of date or something.

Specifically, the English (Porter2) files and the Norwegian files.

I ran the latest Snowball ANSI stemmers on the voc.txt files and in both
cases, the output didn't match the expected output.txt file available on the
Snowball web site.

In the case of the English stemmer, 176 words produced the wrong output. It
seems they're all words with either one or two letters, such as "a", "ac",
"ap", etc. In each case, the stemmed output is an empty string.

In the Norwegian stemmer, nearly half of the output doesn't match up at all,
with 10215 of the 20628 words failing.

Is this a case of the output.txt/diff.txt files being out of date, or the
stemmers themselves being out of date.

If anybody would like to see what I'm getting for output, I can post them to a
web site...

Cheers,

J



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST