[Snowball-discuss] Re: Bug report?

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Tue Oct 07 2003 - 19:45:02 BST


Alexander,

Yes, I was aware of this, and should explain:

The Porter stemmer, as originally defined, reduces "s" to null, and is
faithfully
implemented in the stemmer at

   http://snowball.tartarus.org/porter/stemmer.html

The version of the Porter stemmer which I distributed for many years stems
"s" to
"s" however. This is because it has a couple of improvements (points of
DEPARTURE)
from the published algorithm which everyone has come to accept. These
improvements
are in the slightly different version of the stemmer at

   http://www.tartarus.org/~martin/PorterStemmer/

and are clearly marked DEPARTURE in the commments in the ANSI C version of the
stemmer - as well being described in the accompanying text.

I can't alter this now, bugs or not, because of the status of the Porter stemmer
as a described algorithm, but the Snowball Porter2 stemmer fixes these
problems and
many others besides.

I would agree that it is not helpful to stem "s" to null, but would not
agree that
stemming to null is invariably bad (although none of the Snowball stemmers on
current release do so). See the notes introducing the Russian stemmer.

I can't explain the problems you had with email I'm afraid. I've certainly
received executables, and files containing viruses, as unwanted attachments,
within the past few months.

Martin

> I found a phrase
>
> "In any case a string of length 1 will be unchanged if passed
>through the algorithm".
>
>Indeed, I always thought a stemmer should NOT produce empty stems, no? This
>is very inconvenient in practice since it changes file formats, word counts,
>etc.
>
>However, it seems the algorithm does strip "s" -> "". (This is the only rule
>producing empty strings.) In effect, the program at
>http://snowball.tartarus.org/porter/stemmer.html does it; I attach the
>corresponding files (I found no way to send the executable due to a paranoic
>antivirus software at Tartarus).
>
>Is this correct? Wouldn't you rather change the unconditional rule
>
> S -> cats -> cat
>
>to
>
> (*v or *c) S -> cats -> cat
>
>Thank you!
>Alexander



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:45 BST