Re: [Snowball-discuss] More patches

From: Richard Boulton (
Date: Thu Feb 15 2007 - 14:16:18 GMT

Olly Betts wrote:
> On Mon, Feb 12, 2007 at 08:30:28AM +0000, Olly Betts wrote:
>> This adds a "make check" rule which verifies that the UTF-8 and
>> ISO-8859-1 versions of the stemmers actually produce the expected
>> output on the test vocabulary.
> This patch extends the rules so that "make check" will print a warning
> for algorithm/encoding combinations for which there's no test data.
> This isn't used by the sources as shipped, but if you enable other
> algorithms, it's useful:
> Alternatively, perhaps we should just generate test data by running a
> suitable vocabulary through the stemming algorithm - that will at least
> allow checking that no regressions are introduced by changes to the
> snowball compiler and runtime. The missing data is for lovins, german2,
> and romanian2, and we have english, german, and romanian vocabulary for
> other stemmers. If that seems a better approach, I'm happy to provide
> a patch to do that instead.

Since we have suitable sample data for each of these, perhaps we should
just add the current output of each of these stemmers to svn in some
appropriate place, and test with them. Something like
"data/english/output-lovins.txt" for the lovins stemmer, for example.

There's a great deal of convenience value in having the expected output
checked into SVN, I believe.

A warning for any stemmers which we haven't supplied an expected output
file would be a good thing, so your patch is certainly on the right lines.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST