Re: [Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

From: Arjen van der Meijden (arjen@glas.its.tudelft.nl)
Date: Mon Jan 05 2004 - 15:54:01 GMT


Edwin,

I have a few counter remarks, but you're mostly right, I think :)

Edwin de Jonge wrote:

> Just like you I don't think that there is/will be a perfect stemming
> algorithm for Dutch (with the exception of the absurd snowball program
> where for every word the stem is listed (the dictionary approach).

That exception won't work as well, due to the disambiguation you
mention. Anyway, it's a thing to just accept, if you'd try to fix that,
you'd probably end up with a massive application with an enormous amount
of rules, which still makes mistakes ;)

> Nice that you did a search! Not trying to be a wise guy, but the words
> you have found all are of foreign origin:
> mazzel(en) = yiddisch/hebrew
> puzzle/puzzelen = english
> quizzen = english (plural of quiz)
> But you are right: if they are in "van Dale" then they are Dutch words
> by definition.

Not entirely, they are accepted into the Dutch language, and therefore
are inserted into "Van Dale", not the other way around :)
But then again, once they are in Van Dale, you can be pretty sure it's
Dutch (or accepted in the Dutch language), although a word that is not
in Van Dale, might still be a real Dutch word.

> Our 'strong verbs' are indeed a real pain in the butt for snowball.
> Luckely in modern Dutch more
> and more "strong" verbs are turning into "weak verbs" (but this is a
> slow process, for example before 1930
> Past tense of "wassen"(=wash) was "wies" in stead of "waste(n)").
I didn't know that :)

> True, but as said before snowball doesn't do disambiguation.
> But it still is desirable that "manen" (in its different senses)
> maps to "maan" (in the same different senses).

Yeah, your approach has the advantage of stemming to more stems, while
mannen and manen can both have a few meanings, it is, indeed a win if
maan and manen stays distinct from mannen and man.

Even if that means that both the horse's and saturn's manen get stemmed
to maan.

>>Make sure you don't strip 'ig' if it was 'tig', like dertig, gretig,
>>nattig, etc.
>>Actually, perhaps you shouldn't strip 'ig' at all, bazig
>>means something
>>different than baas. And most, if not all, -ig versions of nouns and
>>verbs have a (sometimes slightly, derived) different meaning.
>
> For this one I'm neutral. I think "ig" suffix is (used) the same as the
> "y"
> suffix in english. (e.g. boss, bossy, wet, wetty).
> You are right about the shift in meaning, but I'm not sure if it is
> enough.

I don't really know whether the shift in meaning is very bad or whether
it'll result in weird clashes.
But the change is larger than the change you get from simply stemming a
plural to its singular form and stemming verbs to their stem.

> Same issue here I think: disambiguation. The change proposed stems these
> words to the correct stem (only they are still ambiguous).

Yep, it really appears as if we just loved making our language as
ambiguous as possible.

Best regards,

Arjen



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST