[Snowball-discuss] The Great Stemmer Enumeration Challenge

From: Allan Fields (afieldsml@idirect.ca)
Date: Wed Apr 10 2002 - 19:44:28 BST


Forwarding to the list... Modified version of my original email.

---------- Forwarded Message ----------
From: Allan Fields <afieldscom@idirect.ca>
Subject: The Great Stemmer Enumeration Challenge
Date: Fri, 5 Apr 2002 06:55:38 -0500
To: martin@tartarus.org

Hi,

The following is a list of stemmers I spotted... It's unbelievably hard to
keep track of them all (-- and these are just the Perl stemmers). [snip]
I guess it would come as no suprise then, that I was actually planning to
implement YAS (Yet Another Stemmer) to add to the mass. If possible, I would
like to avoid this unnecessary branching, and so I'll try to find existing
implementations to contribute to. However, with so many to pick from, I've
been at a loss at which to employ.

So the challenge begins... Join in to The Great Stemmer Enumeration
Challenge! Come one, come all, bring your Stemmer spottings and a magnifying
glass. [snip - bad joke]

** Perl Stemmers:

1.
- Filename: perl.txt
- URL: http://www.tartus.org/~martin/PorterStemmer/
- Package: (undef)
- Description: Martin Porter's Perl official/reference stemmer
- Date: 1990 onward?
- Commentary: [snip]
- Strength: Accuracy.

2.
- Filename: porter.pm
- URL: http://www.ldc.usb.ve/~vdaniel/porter.pm
- Package: (undef)
- Description: Daniel van Balen's Perl version
- Date: October-1999
- Commentary: Conditionals for speed improvement, lots of of repetition.
- Strength: Speed?

3.
- Filename: stem.pl
- URL: http://www.cpan.org
- Package: (undef)
- Description: Ian Phillipps' WAIS stemmer.c derivative.
- Date: ?
- Commentary: .
- Strength: Simplicity, Flowingness.

4.
- Filename: English.pm
- URL: http://www.cpan.org
- Package: Text::English
- Description: Modularized version of stem.pl by Ulrich Pfeifer.
- Date: Thu Feb 1 13:47:58 1996
- Commentary: Bad placement -- belongs in Lingua::En on CPAN
- Strength: Simple. It actually has a package name and is CPAN-friendly.

5.
- Filename: Stem.pm, En.pm
- URL: http://www.cpan.org
- Package: Lingua::Stem
- Description: A more complete approach to Perl stemming. May have been
branched off of Text::English then moved to Text::Stem before being moved to
Lingua::Stem. Jim Richardson, University of Sydney <imr@maths.usyd.edu.au>
and Benjamin Franz <snowhare@nihongo.org>.
- Date: 1999, 2000 fixed missing rules
-Commentary: Uses strange symbolic reference subroutine calls. Is this
really necessary? (Why?) Assumes US English?
- Strength: Caches results, allows exceptions, OO interface. Multiple
languages may be supported in future.

6.
- Filename: ?
- URL: http://www.cpan.org
- Package: ROADS::Porter
- Description: A class to perform stemming using the Porter algorithm. (UK -
eLib/ROADS/DESIRE Library Project)
- Date: 1988??
- Commentary: Haven't even looked yet.
- Strength: Hell if I know. :)

7.
- Filename: perl.tgz
- URL: http://snowball.sourceforge.net
- Package: Lingua::Stem::Snowball
- Description: A perl wrapper for snowball stemmer
- Date: 2002
- Commentary: Haven't tested yet.
- Strength: Probably a more logical approach than porting the stemmer
directly to Perl. Uses XS? The performance gains may be significant.

8.
- Filename: Stemmer.pm
- URL: Not yet released
- Package: Yet another stemmer, not decided
- Description: My attempt at implementing the Porter Stemming algorithm.. YET
AGAIN for my first time :)... Also including other custom features for
different types of word stemming.
- Date: 2002
- Commentary: Oh no! Not another one...
- Srength: I know the author. The author might attempt to make it into a GUR
(Grand Unified Regex) in Perl if that is even possible, just for the sake of
obfuscation. Or I'll leave that to japhy. (hehe... He'll do anything if it
involves a regex challenge.)

** Other:

1.
- URL: http://www.cogsci.princeton.edu/~wn/
- Commentary: What about Wordnet at Princeton. Do they use it too in their
morphy thing? Is wordnet cool or what? :)

How does Wordnet fit into the idea of dictionary based stemming? It might
make sense to supplement the stemmer with Wordnet like lexical information
derived from current day English usage (and/or multiple dialects) to avoid
mis-stemming or over-stemming. Martin, your paper covers this idea
thoroughly in section 3 "Stemming errors, and the use of dictionaries". It
serves to clarify that no algorithm is going to be perfect, which I was going
to raise in another email with regards to all the -ing exceptions and how it
doesn't seem like a purely algorithmic stemmer can tackle those without
exceptions.

Has anyone created a comprehensive exceptions dictionary for stemming, or a
starting point for one that could perhaps be interfaced with the stemmer? If
not, perhaps the fellows at Princeton would be interested in integrating this
with Wordnet somehow.

Another idea is to simply create a new database structure with relational
aspects of words and particles creating a list of capabilities or composition
rules. It would make sense to tie this into all Snowball target languages,
whenever possible. The Perl Lingua::Stem module currently maintains it's own
configurable exception list for instance. If this was centralized into a
proper database (Berkeley DB probably) that integrated exceptions/relations
it would improve overall cross-platform accessibility. But this may not be
directly the responsibility of the stemmer?

With the stemmer so far, the principal has been to stay as close as possible
to simple (dictionary) common words, and to avoid chopping off too much or
too little in the process of coming up with a stem. But maybe in the future
this can be presented as an option that allows a gradient of decomposition.
At some point a stemmer could potentially cross over into the role of a word
root finder, but that would require prefix stripping, which wouldn't
necessarily fit into the idea of a stemmer. Then it gets into the
etymological scope as is mentioned in the paper.

Also I acknowledge the imperfections with stemming and for that matter all
other forms of natural language processing. The language just wasn't
designed to be easily processed by computers, after all the whole idea of
language in computer applications is relatively new (in the grand scheme of
things.) =)

-- Allan Fields

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:41 BST