[Snowball-discuss] Re: snowball

From: Olly Betts (olly@survex.com)
Date: Sun Oct 06 2002 - 18:17:01 BST


On Sun, Oct 06, 2002 at 10:20:34AM -0600, Martin Porter wrote:
>
> > what is the Snowball project's scope? Do you intend to supply other
> > related stuff? Such as stopword lists, accent normalisation code
> > (e.g. conflating ä and ae in German), language recognition, etc.
>
> I was not thinking of adding language recognition or accent normalisation
> work, although it seemed to me that lists of stopwords would be useful. But
> the lists are available in Xapian, and interestingly no-one has requested
> them via Snowball. And my lists are incomplete - I don't have a Finnish stop
> word list for example (the stemmer was developed from a Finnish vocab list,
> not a sample of text.) And they need some reworking and annotation.

I'm fairly happy whatever the answer is, but it would be good to decide
so we know how they fit into Xapian.

That said, my feeling is that stopword lists probably do belong with the
stemmers. And accent normalisation too - it's fairly simple code, and
closely related to stemming.

Language indentification is less closely connected, and more
complicated, so probably better to keep separate.

> What I hoped initially was that the Snowball site would attract other
> contributions. Not necessarily using Snowball itself, but covering stemming.

Incidentally, I found this site yesterday, which has a good collection
of information about English stemming:

http://www.comp.lancs.ac.uk/computing/research/stemming/

> For example, I offered to put up a PhD thesis in html form that was about
> stemming evaluation. The offer was turned down, with the result that the
> work is still relatively inaccessible.
>
> Of course the main intention was to see stemmers for other languages
> developed by other people but along similar lines, but that has not happened.

It still early days - the snowball site has been around for less than a
year, hasn't it?

Also, the site doesn't actually seem to say anywhere that general
stemming related contributions are welcome...

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST