[Snowball-discuss] RE: Russian stemmer - adj suffix -n-

From: Martin Porter (martin_porter@softhome.net)
Date: Tue Sep 17 2002 - 08:20:02 BST


Svetlana,

It looks as if you are going to investigate -n- removal in the future, and
it would be interesting to hear how things progress, so please let us know!
Meanwhile, I don't really have the time or language knowledge to be able to
make much more progress here.

Your comments on vzysk- are interesting, but that was of course one word
picked out of many that illustrates the problem. Clearly one needs to look
at the pattern through the whole vocabulary.

Martin

At 08:45 PM 9/14/02 +1200, Svetlana Pereyaslavets wrote:
>Martin
>Thank you for the email.
>
>
>>The problem I find is that -n- is too frequently removed erroneously.
>
>Yes, true. The main problem with this particular suffix is that it must not
be present in stemmed words. Very often, there are some suffixes (like -sk-
-och-) between the root and this suffix. And these suffixes must be also
removed from the stems.
>
>I have not tested yet, how well the stemming algorithm can accommodate
rules with -n-. Beforehand, I would expect that removal of -n- may provide
with the better overall performance of the stemming procedure, although many
words could be overstemmed . This can be purely determined by testing of the
stemmer.
>
>
>> For example, vzyskanie (penalty) is a noun, with noun ending -ie, although
>> -ie is equally an adjectival ending. Removing -n- after removing -ie is
>> over-stemming, and the result then fails to conflate with vzyskan+X, where X
>> is a valid noun ending that is not also an adjectival ending. There are many
>> cases like this.
>
>This form is actually one of luckiest cases with -n- removal. If -n- is be
removed and then the following suffix -an- is be removed, the word would be
stemmed perfectly.
>
>These are entries of this word from the Russian stemmer output page. If
suffix -n- in adjectives is removed jointly with "obvious" verb suffixes -iv
an -a- , all these words could be stemmed to the desirable stem "vzysk-".
>взыскан
>взыскан
>взыскан
>взыскан
>взыскан
>взыскан
>взыскательн (*)
>взыскательн
>взыскательн
>взыска
>взыскива
>взыскива
>
>There is a small lexical problem with this particular word, though.
>The form marked with a (*) together with an adjective or adverb ending,
means "exacting, exigent, strict, demanding". Strictly speaking it should be
stemmed as it is now. On the other hand, this word is rather an epithet to
describe some human features, and is "rare" in any professional literature.
So this word can be overstemmed to vzysk- without a significant loss of
"information".
>
>All other forms have the same meaning : "1) levy, collecting 2) punishment,
penalty, reprimand " (ref. http://www.lingvo.ru/lingvo/). I believe that
this, second, meaning is a "more valuable" term in the literature that
usually undergoes text retrieval procedures.
>For all other words that take the same morphological forms, this suffix
-el'- should not cause any problems.
>
>I hope this explains your concern a little bit, and I will definitely run
and test the Russian stemmer later.
>
>> It occurs to me you must know Eibe Frank.
>
>Yes, of course. He is a lecturer at our department. I took his Machine
Learning course last semester. He was pleased to hear that you mentioned him.
>
>Kind regards
>Svetlana
>
>



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST