Re[6]: [Snowball-discuss] an inconsistency with Russian stemmer

From: Andrew Aksyonoff (
Date: Sun Nov 18 2001 - 13:17:07 GMT

Hello Martin!

Saturday, November 17, 2001, 8:22:18 PM, you wrote:
MP> I can see how to fix the problem, but I am waiting to get in touch with Pat
MP> Miles (our Russian consultant) before doing so. I want to make sure I get it
MP> right.

Meanwhile, I've generated a word forms dictonary from latest
Russian Ispell dictionary (the largest that I could find so far).
The result contains 99208 original forms and 923478 derivate forms.

Let OF() be mapping from derivate forms to corresponding original
forms. I supposed that the ideal stemming function S() would satisfy
the following criteria:

1) if OF(A) = OF(B), then S(A) = S(B)
2) if S(A) = S(B), then OF(A) = OF(B)

So I checked the current Russian stemmer against both, counting
the amount of errors. There was 26335 errors of first kind (ie,
different stems for the same original form) and 44157 errors of
second kind (ie, same stems for different original forms).

After some hacking, I was able to achieve 12252 errors of
first kind and 34632 errors of second kind. Changes I made
seem to stem participles (in the Russian sense of the word)
much better, and stemming of verbs and verbals is also improved,
though it's still far from being perfect. Some improvement
was also achieved by adding support for endearment suffixes.

So, are you interested in reviewing my changes and potentially
incorporating them in the algorithm?

- Andrew

Snowball-discuss mailing list

VirusChecked by the Incepta Group plc

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST