Re: [Snowball-discuss] Polish stemmer?

From: Dawid Weiss (dawid.weiss@cs.put.poznan.pl)
Date: Thu Aug 30 2007 - 08:25:31 BST


> Hi everyone, thank you for your replies! The way I would like to use the
> stemmer is as an additional tool along with an inflection dictionary, to get
> base forms of words unknown in the dictionary.

Note stemming isn't meant to accomplish this task in a perfect way. I like the
distinction between lemmatization and stemming as an accurate base form (lemma)
vs. a distinct token denoting a concept (not necessarily a lemma, but unique).

> enough to reduce the problem of multiple forms of unknown words in the
> collection index. I noticed the Stempelator stemmer has problems with such
> words, so I wonder whether a simpler suffix stripper wouldn't suffice.

That's basically what Stempelator (and in fact Stempel) does -- it is a trained
aka-decision tree for suffix stripping. You may want to read Andrzej Bialecki's
description of Stempel and Leo Galambos' PhD thesis where the algorithm is
described in detail. I agree it doesn't work very efficiently, which only makes
the problem more interesting.

Dawid



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST