Re: [Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

From: Martin Porter (
Date: Sun Jan 04 2004 - 13:49:02 GMT


>Would that stem words like these Dutch words:
>cd'tje -> cd
>tv'tje -> tv
>a4'tje -> a4
>baby'tje -> baby

- a good point. They are not of course removed at present (see the note in )

We'll have to look into it a bit further. But I would not expect it to
affect things much. Switching to English, the main advantage of handling
apostrophe is that one can distinguish ending -ss (where the last -s would
not be removed) from ending -s's (where the last s should be removed). But
this is obviously a minor point, given the rarity of -s's endings (the
cyclops's cave etc).

I think you are right about the difficulty of stemming Dutch. The problem
is that the language assimilates foreign inflections as readily as foreign
words. This is especially noticeable in modern technical writing.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST