[Snowball-discuss] Finnish stemmer: some suggestions and some doubts

From: Vili Lehdonvirta (vili.lehdonvirta@hut.fi)
Date: Sat Nov 29 2003 - 12:11:01 GMT


Hi all,

First of all let me say that the Finnish stemmer is an impressive work
from someone who presumably does not speak the language. However, a quick
glance at the sample vocabulary immediately reveals instances of what to
me seems like understemming. I spent some time looking at this and here's
what I found (if you're not interested in reading about minor
improvements to the algorithm, please skip to my questions, which are of a
more general nature).

In one class of instances I think the understemming is due to a shortcut
taken by the algorithm. In Finnish, some possessive suffixes (nsa nsä mme
nne) may absorb the genitive case suffix n. For example:
edeltäjä predecessor
edeltäjän predecessor's
edeltäjänsä his predecessor, his predecessor's (polyseme)

The algorithm stems these by first removing the possessive suffix (step
1), if any, and then the genitive case suffix (step 3), if any. Finally,
the trailing ä is removed. For all of the above words the resulting stem
is edeltäj, which seems fine.

However, if the genitive suffix is added to a plural, the plural is
manifested in various different ways before the suffix. For example:
edeltäjät predecessors
edeltäjien predecessors'

The algorithm accounts for some plurals in step 6 (b-d), and for the
particular type in the example above in the last rule of step 3. Thus,
both words are stemmed to edeltäj.

So, finally, here comes the problem:
edeltäjiensä his predecessors'

For the above word, step 1 correctly recognizes the possessive suffix and
proceeds to delete it. However, the remaining word edeltäjie does not
trigger the genitive suffix rule in step 3. The suffix n has been removed,
but the plural identifier ie remains. Thus, the word is stemmed to
edeltäjie, not edeltäj.

I think this could be fixed by modifying step 1 so that (nsa nsä mme nne)
would not be deleted, but changed into n. n is then later removed in step
3, along with ie, if present. I can't think of any side effects, though I
have not run any tests with the vocabulary.

Another type of understemming of which there seems to be a lot of in the
sample vocabulary is due to the possessive suffixes iaan, iään not being
recognized by the algorithm at all. However, this is not so
straightforward, as those endings may also indicate something else,
particularly for imported words like akatemia, Austraalia. This would
need more looking into before any fixes can be suggested.

Now to the doubts part. I was looking for something meaningful to do for
the Nutch project, which led me to Lucene, which led me to wonder if there
are good algorithms for normalizing Finnish, which led me here. Is this
algorithm being used in any applications? Would it be worth it spending
some time on it? Is the algorithmic stemming approach for normalizing
Finnish the best choice for projects like Nutch, in your opinion? How
about "morphological analysis" [1]?

Finally, I'm beginning to wonder whether I should just leave stemming
and normalization for IR experts and linguists. I've done basic university
cs studies and have a few years of working experience, but that's as far
as it goes.

[1] http://www.linguistlist.org/issues/4/4-862.html

Cheers,

-- 
	Vili Lehdonvirta
	vili.lehdonvirta@hut.fi
	+65 94367590



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST