I have looked through your note, and tried out -n- removal by modifying the
stemmer. I can't get a satifactory result, and suspect this is one of the
things I tried and rejected when first developing it, although I can't be
too certain of this. (Although later revised, the Russian stemmer was
developed more than ten years ago, so I forget certain details.)
The problem I find is that -n- is too frequently removed erroneously. In
your email you refer to "a dictionary" a couple of times, but it is
important to remember of course that the Snowball stemmers are purely
algorithmic and do not use dictionaries.
For example, vzyskanie (penalty) is a noun, with noun ending -ie, although
-ie is equally an adjectival ending. Removing -n- after removing -ie is
over-stemming, and the result then fails to conflate with vzyskan+X, where X
is a valid noun ending that is not also an adjectival ending. There are many
cases like this. One could try to compensate by stripping off all final n's,
but that is liable to result in many false conflations.
Perhaps with your knowledge of the language you could make it work. If so,
Snowball would be ideal for your experiments.
P.S. It occurs to me you must know Eibe Frank.
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST