[Snowball-discuss] Sample Russian vocabulary

From: smp29 (smp29@cs.waikato.ac.nz)
Date: Sun Feb 09 2003 - 19:21:02 GMT


I am making some changes to the Russian stemmer, and want to evaluate
the effectiveness of these modifications.
It concerns me that the sample file for the Russian stemmer (voc.txt),
that is supplied with the stemmer, is not very typical. Its vocabulary
sounds like retrieved from Chechov's books. The voc.txt contains many
words that are not in frequent use in the documents we normally stem
(scientific articles, news, technical reports).

Does anybody, by chance, have any sample vocabularies of the most common
words of the modern Russian language. With or without stop-words, it
does not matter. The size of such a file may be similar or smaller than
the current voc.txt. Or, could you please recommend, if it is possible
to download such a file from somewhere else.
In fact, my question is : has anybody done such a type of evaluation?
Would you recommend to be very selective with a test vocabulary, or
choice vocabulary still does not have much significance on the overall
stemmer evaluation?

Would be grateful for your references and advice.

Kind regards

