Saturday, November 17, 2001, 8:22:18 PM, you wrote:
MP> I can see how to fix the problem, but I am waiting to get in touch with Pat
MP> Miles (our Russian consultant) before doing so. I want to make sure I get it
Meanwhile, I've generated a word forms dictonary from latest
Russian Ispell dictionary (the largest that I could find so far).
The result contains 99208 original forms and 923478 derivate forms.
Let OF() be mapping from derivate forms to corresponding original
forms. I supposed that the ideal stemming function S() would satisfy
the following criteria:
1) if OF(A) = OF(B), then S(A) = S(B)
2) if S(A) = S(B), then OF(A) = OF(B)
So I checked the current Russian stemmer against both, counting
the amount of errors. There was 26335 errors of first kind (ie,
different stems for the same original form) and 44157 errors of
second kind (ie, same stems for different original forms).
After some hacking, I was able to achieve 12252 errors of
first kind and 34632 errors of second kind. Changes I made
seem to stem participles (in the Russian sense of the word)
much better, and stemming of verbs and verbals is also improved,
though it's still far from being perfect. Some improvement
was also achieved by adding support for endearment suffixes.
So, are you interested in reviewing my changes and potentially
incorporating them in the algorithm?
Snowball-discuss mailing list
VirusChecked by the Incepta Group plc
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST