Re[6]: [Snowball-discuss] an inconsistency with Russian stemmer

From: Andrew Aksyonoff (shodan@chat.ru)
Date: Sun Nov 18 2001 - 13:17:07 GMT


Hello Martin!

Saturday, November 17, 2001, 8:22:18 PM, you wrote:
MP> I can see how to fix the problem, but I am waiting to get in touch with Pat
MP> Miles (our Russian consultant) before doing so. I want to make sure I get it
MP> right.
Okay.

Meanwhile, I've generated a word forms dictonary from latest
Russian Ispell dictionary (the largest that I could find so far).
The result contains 99208 original forms and 923478 derivate forms.

Let OF() be mapping from derivate forms to corresponding original
forms. I supposed that the ideal stemming function S() would satisfy
the following criteria:

1) if OF(A) = OF(B), then S(A) = S(B)
2) if S(A) = S(B), then OF(A) = OF(B)

So I checked the current Russian stemmer against both, counting
the amount of errors. There was 26335 errors of first kind (ie,
different stems for the same original form) and 44157 errors of
second kind (ie, same stems for different original forms).

After some hacking, I was able to achieve 12252 errors of
first kind and 34632 errors of second kind. Changes I made
seem to stem participles (in the Russian sense of the word)
much better, and stemming of verbs and verbals is also improved,
though it's still far from being perfect. Some improvement
was also achieved by adding support for endearment suffixes.

So, are you interested in reviewing my changes and potentially
incorporating them in the algorithm?

- Andrew

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST