Hi,
First I want to thank Martin Porter (and everyone else working on
snowball) for his work on snowball.
As search engine for a research project we are using a .NET port of
Lucene (lucene.NET, not to be confused with nlucene).
Because this port doesn't have a dutch stemmer, I've implemented the
dutch snowball stemming algorithm in C#.
(my implementation will be available in a next version of Lucene.NET).
It stems the dutch snowball vocabulary exactly as snowball does.
I think I have found a small improvement in the dutch stemming algorithm
(beware, I'm not a linguist).
The routine
define undouble as (
test among('kk' 'dd' 'tt') [next] delete
)
will be improved if the "nn", "mm" and "ff" endings are also removed.
define undouble as (
test among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete
)
After this algorithm change, the stemmed dutch snowball vocabulary has
494 differences with the old stemmed vocabulary. (That is in my
implementation)
(Almost) All of these differences are improvements:
plural are correctly stemmed the same as singulars:
"mannen" -> "man" (=men, man)
"stoffen" -> "stof" (=substance)
"vlammen" -> "vlam" (=flame)
infinitives are correctly stemmed to verb stem
"kennen" -> "ken" (=know)
"treffen" -> "tref" (=hit)
"zwemmen" -> "zwem" (=swim)
The only strange difference (of the 494) I've found is "binnen"
(=inside) was stemmed to "binnen" and is now stemmed to "bin". This is
not a problem since this new stem is not taken by another word.
Regards,
Edwin de Jonge
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST