[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

From: Edwin de Jonge (ejne@rnd.vb.cbs.nl)
Date: Thu Jan 01 2004 - 13:35:02 GMT


Hi,
 
First I want to thank Martin Porter (and everyone else working on
snowball) for his work on snowball.
 
As search engine for a research project we are using a .NET port of
Lucene (lucene.NET, not to be confused with nlucene).
Because this port doesn't have a dutch stemmer, I've implemented the
dutch snowball stemming algorithm in C#.
(my implementation will be available in a next version of Lucene.NET).
It stems the dutch snowball vocabulary exactly as snowball does.
 
I think I have found a small improvement in the dutch stemming algorithm
(beware, I'm not a linguist).
The routine
 
    define undouble as (
        test among('kk' 'dd' 'tt') [next] delete
    )
 
will be improved if the "nn", "mm" and "ff" endings are also removed.
 
    define undouble as (
        test among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete
    )
 
After this algorithm change, the stemmed dutch snowball vocabulary has
494 differences with the old stemmed vocabulary. (That is in my
implementation)
(Almost) All of these differences are improvements:
    plural are correctly stemmed the same as singulars:
"mannen" -> "man" (=men, man)
"stoffen" -> "stof" (=substance)
"vlammen" -> "vlam" (=flame)
    infinitives are correctly stemmed to verb stem
"kennen" -> "ken" (=know)
"treffen" -> "tref" (=hit)
"zwemmen" -> "zwem" (=swim)
 
The only strange difference (of the 494) I've found is "binnen"
(=inside) was stemmed to "binnen" and is now stemmed to "bin". This is
not a problem since this new stem is not taken by another word.
 
Regards,
 
Edwin de Jonge



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST