Thank you for your email. To answer your last question first: you should not
imagine stemming algorithms are the special preserve of experts in IR and
linguistics. You need some understanding of the needs of IR systems, and a
knowledge of grammar, or at least of morphology, but no great expertise is
required. This is especially true with linguistics. Too much theoretical
linguistics will not help much in designing stemmers, which are after all
just practical tools to improve IR performance.
You may be surprised to learn that you are the first Finnish speaker ever to
have commented on the algorithm. I communicated with Kalervo Jarvelin of
Tampere university while doing the work (email@example.com), who was
most helpful, but not after its completion. I suspect therefore the
algorithm has not been used much (if at all) in Finland, but I know that all
the snowball algorithms are sometimes incorporated *en bloc* into other work.
If you want to work on the algorithm it might be useful to contact Kalervo.
I see that Nutch is an open source search engine, or will be when written.
(Perhaps you have come across Xapian - no point in reinventing the wheel ;-)
) Stemmers I think should be part of Nutch's armoury, even if it
incorporates morpho analysis work.
I am aware of understemming issues in the Finnish stemmer. Your idea for
handling iensä etc endings looks well worth investigating. (The way plurals
work is the trickiest part of the whole ending structure.) I seem to recall
that iaan, iään endings cannot be removed with safety, because of the many
overstemming cases. You are right that the algorithm was done by a
non-native speaker (me), and so it certainly is due for further
investigation. If you would like to try it that would be great. Otherwise
I'll test out your ideas next year. (I'm rather busy with other things at
A word of warning: the Java generated Finnish stemmer does not work because
the stemmer uses a Snowball feature not used by the other stemmers, on which
the Java codegenerator were tested. Neither Richard Boulton nor I currently
have the resources to fix this.
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST