Re: [Snowball-discuss] More patches

From: Olly Betts (olly@survex.com)
Date: Fri Feb 16 2007 - 12:06:04 GMT


On Fri, Feb 16, 2007 at 11:35:09AM +0000, Richard Boulton wrote:
> Olly Betts wrote:
> >A related issue - there are a small number of examples in the hungarian
> >vocabulary which contain upper case ASCII letters. Would it make sense
> >to just change these to lower case for consistency with the other test
> >vocabularies?
>
> I think it would make sense to change these to lower case, so I've done
> so. It doesn't change the output.txt file at all (as expected).

Indeed, since stemwords lowercases the input words. I noticed them when
Xapian's stemtest failed on these words, since it didn't lowercase the
test vocabulary.

I wonder if the algorithms should perform lowercasing for you. In
general it's a required preprocessing step for the stemmers to work
correctly, so most users will need to implement the lower casing for
themselves (except perhaps for applications where the input is always
lowercase already).

The problem I can see is that to do it correctly for all non-ASCII
characters requires fairly large tables, and doing it just for ASCII
letters probably isn't really sufficient. Perhaps it's only necessary
for characters the stemmers check for though. Thoughts?

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST