Russian stemming algorithm

Links to resources

The Snowball stemmer represents the Cyrillic alphabet with ASCII characters, following the standard Library of Congress transliteration scheme. The vocabulary + stemmed equivalent is also viewable in this transliterated form.

Here is a sample of Russian vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem word stem

в
вавиловка
вагнера
вагон
вагона
вагоне
вагонов
вагоном
вагоны
важная
важнее
важнейшие
важнейшими
важничал
важно
важного
важное
важной
важном
важному
важности
важностию
важность
важностью
важную
важны
важные
важный
важным
важных
вазах
вазы
вакса
вакханка
вал
валандался
валентина
валериановых
валерию
валетами
вали
валил
валился
валится
валов
вальдшнепа
вальс
вальса
вальсе
вальсишку
вальтера
валяется
валялась
валялись
валялось
валялся
валять
валяются
вам
вами => в
вавиловк
вагнер
вагон
вагон
вагон
вагон
вагон
вагон
важн
важн
важн
важн
важнича
важн
важн
важн
важн
важн
важн
важност
важност
важност
важност
важн
важн
важн
важн
важн
важн
ваз
ваз
вакс
вакханк
вал
валанда
валентин
валерианов
валер
валет
вал
вал
вал
вал
вал
вальдшнеп
вальс
вальс
вальс
вальсишк
вальтер
валя
валя
валя
валя
валя
валя
валя
вам
вам п
па
пава
павел
павильон
павильонам
павла
павлиний
павлиньи
павлиньим
павлович
павловна
павловне
павловной
павловну
павловны
павловцы
павлыч
павлыча
пагубная
падает
падай
падал
падала
падаль
падать
падаю
падают
падающего
падающие
падеж
падение
падением
падении
падений
падения
паденье
паденья
падет
падут
падучая
падчерицей
падчерицы
падшая
падшей
падшему
падший
падшим
падших
падшую
паек
пазухи
пазуху
пай
пакет
пакетом
пакеты
пакостей
пакостно
пал => п
па
пав
павел
павильон
павильон
павл
павлин
павлин
павлин
павлович
павловн
павловн
павловн
павловн
павловн
павловц
павлыч
павлыч
пагубн
пада
пада
пада
пада
падал
пада
пада
пада
пада
пада
падеж
паден
паден
паден
паден
паден
паден
паден
падет
падут
падуч
падчериц
падчериц
падш
падш
падш
падш
падш
падш
падш
паек
пазух
пазух
па
пакет
пакет
пакет
пакост
пакостн
пал

The stemming algorithm

i-suffixes (*) of Russian tend to be quite regular, with irregularities of declension involving a change to the stem. Irregular forms therefore usually just generate two or more possible stems. Stems in Russian can be very short, and many of the suffixes are also particle words that make ‘natural stopwords’, so a tempting way of running the stemmer is to set a minimum stem length of zero, and thereby reduce to null all words which are made up entirely of suffix parts. We have been a little more cautious, and have insisted that a minimum stem contains one vowel.

The 32 letters of the Russian alphabet are as follows, with the transliterated forms that we will use here shown in brackets:

а (a)

б (b)

в (v)

г (g)

д (d)

е (e)

ж (zh)

з (z)

и (i)

й (ì)

к (k)

л (l)

м (m)

н (n)

о (o)

п (p)

р (r)

с (s)

т (t)

у (u)

ф (f)

х (kh)

ц (ts)

ч (ch)

ш (sh)

щ (shch)

ъ (")

ы (y)

ь (')

э (è)

ю (iu)

я (ia)

There is a 33rd letter, ё (ë), but it is rarely used, and we assume it is mapped into е (e).

The following are vowels:

а (a) е (e) и (i) о (o) у (u) ы (y) э (è) ю (iu) я (ia)

In any word, RV is the region after the first vowel, or the end of the word if it contains no vowel.

R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.

For example:

    p r o t i v o e s t e s t v e n n o m
         |<------       RV        ------>|
           |<-----       R1       ------>|
               |<-----     R2     ------>|

(See note on R1 and R2.)

We now define the following classes of ending:

PERFECTIVE GERUND:

group 1: в (v) вши (vshi) вшись (vshis')

group 2: ив (iv) ивши (ivshi) ившись (ivshis') ыв (yv) ывши (yvshi) ывшись (yvshis')

group 1 endings must follow а (a) or я (ia)

ADJECTIVE:

ее (ee) ие (ie) ые (ye) ое (oe) ими (imi) ыми (ymi) ей (eì) ий (iì) ый (yì) ой (oì) ем (em) им (im) ым (ym) ом (om) его (ego) ого (ogo) ему (emu) ому (omu) их (ikh) ых (ykh) ую (uiu) юю (iuiu) ая (aia) яя (iaia) ою (oiu) ею (eiu)

PARTICIPLE:

group 1: ем (em) нн (nn) вш (vsh) ющ (iushch) щ (shch)

group 2: ивш (ivsh) ывш (yvsh) ующ (uiushch)

group 1 endings must follow а (a) or я (ia)

REFLEXIVE:

ся (sia) сь (s')

VERB:

group 1: ла (la) на (na) ете (ete) йте (ìte) ли (li) й (ì) л (l) ем (em) н (n) ло (lo) но (no) ет (et) ют (iut) ны (ny) ть (t') ешь (esh') нно (nno)

group 2: ила (ila) ыла (yla) ена (ena) ейте (eìte) уйте (uìte) ите (ite) или (ili) ыли (yli) ей (eì) уй (uì) ил (il) ыл (yl) им (im) ым (ym) ен (en) ило (ilo) ыло (ylo) ено (eno) ят (iat) ует (uet) уют (uiut) ит (it) ыт (yt) ены (eny) ить (it') ыть (yt') ишь (ish') ую (uiu) ю (iu)

group 1 endings must follow а (a) or я (ia)

NOUN:

а (a) ев (ev) ов (ov) ие (ie) ье ('e) е (e) иями (iiami) ями (iami) ами (ami) еи (ei) ии (ii) и (i) ией (ieì) ей (eì) ой (oì) ий (iì) й (ì) иям (iiam) ям (iam) ием (iem) ем (em) ам (am) ом (om) о (o) у (u) ах (akh) иях (iiakh) ях (iakh) ы (y) ь (') ию (iiu) ью ('iu) ю (iu) ия (iia) ья ('ia) я (ia)

SUPERLATIVE:

ейш (eìsh) ейше (eìshe)

These are all i-suffixes. The list of d-suffixes is very short,

DERIVATIONAL:

ост (ost) ость (ost')

Define an ADJECTIVAL ending as an ADJECTIVE ending optionally preceded by a PARTICIPLE ending.

For example, in

бегавшая

бега

вш

ая

(begavshaia

bega

vsh

aia)

ая (aia) is an adjective ending, and вш (vsh) a participle ending of group 1 (preceded by the final а (a) of бега (bega)), so вшая (vshaia) is an adjectival ending.

In searching for an ending in a class, always choose the longest one from the class.

So in seaching for a NOUN ending for величие (velichie), choose ие (ie) rather than е (e).

Undouble н (n) means, if the word ends нн (nn), remove the last letter.

Here now are the stemming rules.

All tests take place in the the RV part of the word.

So in the test for perfective gerund, the а (a) or я (ia) which the group 1 endings must follow must itself be in RV. In other words the letters before the RV region are never examined in the stemming process.

Do each of steps 1, 2, 3 and 4.

Step 1: Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending. As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.

Step 2: If the word ends with и (i), remove it.

Step 3: Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.

Step 4: (1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.

The same algorithm in Snowball


stringescapes {}

/* the 32 Cyrillic letters in the KOI8-R coding scheme, and represented
   in Latin characters following the conventions of the standard Library
   of Congress transliteration: */

stringdef a    hex 'C1'
stringdef b    hex 'C2'
stringdef v    hex 'D7'
stringdef g    hex 'C7'
stringdef d    hex 'C4'
stringdef e    hex 'C5'
stringdef zh   hex 'D6'
stringdef z    hex 'DA'
stringdef i    hex 'C9'
stringdef i`   hex 'CA'
stringdef k    hex 'CB'
stringdef l    hex 'CC'
stringdef m    hex 'CD'
stringdef n    hex 'CE'
stringdef o    hex 'CF'
stringdef p    hex 'D0'
stringdef r    hex 'D2'
stringdef s    hex 'D3'
stringdef t    hex 'D4'
stringdef u    hex 'D5'
stringdef f    hex 'C6'
stringdef kh   hex 'C8'
stringdef ts   hex 'C3'
stringdef ch   hex 'DE'
stringdef sh   hex 'DB'
stringdef shch hex 'DD'
stringdef "    hex 'DF'
stringdef y    hex 'D9'
stringdef '    hex 'D8'
stringdef e`   hex 'DC'
stringdef iu   hex 'C0'
stringdef ia   hex 'D1'

routines ( mark_regions R2
           perfective_gerund
           adjective
           adjectival
           reflexive
           verb
           noun
           derivational
           tidy_up
)

externals ( stem )

integers ( pV p2 )

groupings ( v )

define v '{a}{e}{i}{o}{u}{y}{e`}{iu}{ia}'

define mark_regions as (

    $pV = limit
    $p2 = limit
    do (
        gopast v  setmark pV  gopast non-v
        gopast v  gopast non-v  setmark p2
       )
)

backwardmode (

    define R2 as $p2 <= cursor

    define perfective_gerund as (
        [substring] among (
            '{v}'
            '{v}{sh}{i}'
            '{v}{sh}{i}{s}{'}'
                ('{a}' or '{ia}' delete)
            '{i}{v}'
            '{i}{v}{sh}{i}'
            '{i}{v}{sh}{i}{s}{'}'
            '{y}{v}'
            '{y}{v}{sh}{i}'
            '{y}{v}{sh}{i}{s}{'}'
                (delete)
        )
    )

    define adjective as (
        [substring] among (
            '{e}{e}' '{i}{e}' '{y}{e}' '{o}{e}' '{i}{m}{i}' '{y}{m}{i}'
            '{e}{i`}' '{i}{i`}' '{y}{i`}' '{o}{i`}' '{e}{m}' '{i}{m}'
            '{y}{m}' '{o}{m}' '{e}{g}{o}' '{o}{g}{o}' '{e}{m}{u}'
            '{o}{m}{u}' '{i}{kh}' '{y}{kh}' '{u}{iu}' '{iu}{iu}' '{a}{ia}'
            '{ia}{ia}'
                        // and -
            '{o}{iu}'   // - which is somewhat archaic
            '{e}{iu}'   // - soft form of {o}{iu}
                (delete)
        )
    )

    define adjectival as (
        adjective

        /* of the participle forms, em, vsh, ivsh, yvsh are readily removable.
           nn, {iu}shch, shch, u{iu}shch can be removed, with a small proportion of
           errors. Removing im, uem, enn creates too many errors.
        */

        try (
            [substring] among (
                '{e}{m}'                  // present passive participle
                '{n}{n}'                  // adjective from past passive participle
                '{v}{sh}'                 // past active participle
                '{iu}{shch}' '{shch}'     // present active participle
                    ('{a}' or '{ia}' delete)

     //but not  '{i}{m}' '{u}{e}{m}'      // present passive participle
     //or       '{e}{n}{n}'               // adjective from past passive participle

                '{i}{v}{sh}' '{y}{v}{sh}'// past active participle
                '{u}{iu}{shch}'          // present active participle
                    (delete)
            )
        )

    )

    define reflexive as (
        [substring] among (
            '{s}{ia}'
            '{s}{'}'
                (delete)
        )
    )

    define verb as (
        [substring] among (
            '{l}{a}' '{n}{a}' '{e}{t}{e}' '{i`}{t}{e}' '{l}{i}' '{i`}'
            '{l}' '{e}{m}' '{n}' '{l}{o}' '{n}{o}' '{e}{t}' '{iu}{t}'
            '{n}{y}' '{t}{'}' '{e}{sh}{'}'

            '{n}{n}{o}'
                ('{a}' or '{ia}' delete)

            '{i}{l}{a}' '{y}{l}{a}' '{e}{n}{a}' '{e}{i`}{t}{e}'
            '{u}{i`}{t}{e}' '{i}{t}{e}' '{i}{l}{i}' '{y}{l}{i}' '{e}{i`}'
            '{u}{i`}' '{i}{l}' '{y}{l}' '{i}{m}' '{y}{m}' '{e}{n}'
            '{i}{l}{o}' '{y}{l}{o}' '{e}{n}{o}' '{ia}{t}' '{u}{e}{t}'
            '{u}{iu}{t}' '{i}{t}' '{y}{t}' '{e}{n}{y}' '{i}{t}{'}'
            '{y}{t}{'}' '{i}{sh}{'}' '{u}{iu}' '{iu}'
                (delete)
            /* note the short passive participle tests:
               '{n}{a}' '{n}' '{n}{o}' '{n}{y}'
               '{e}{n}{a}' '{e}{n}' '{e}{n}{o}' '{e}{n}{y}'
            */
        )
    )

    define noun as (
        [substring] among (
            '{a}' '{e}{v}' '{o}{v}' '{i}{e}' '{'}{e}' '{e}'
            '{i}{ia}{m}{i}' '{ia}{m}{i}' '{a}{m}{i}' '{e}{i}' '{i}{i}'
            '{i}' '{i}{e}{i`}' '{e}{i`}' '{o}{i`}' '{i}{i`}' '{i`}'
            '{i}{ia}{m}' '{ia}{m}' '{i}{e}{m}' '{e}{m}' '{a}{m}' '{o}{m}'
            '{o}' '{u}' '{a}{kh}' '{i}{ia}{kh}' '{ia}{kh}' '{y}' '{'}'
            '{i}{iu}' '{'}{iu}' '{iu}' '{i}{ia}' '{'}{ia}' '{ia}'
                (delete)
            /* the small class of neuter forms '{e}{n}{i}' '{e}{n}{e}{m}'
               '{e}{n}{a}' '{e}{n}' '{e}{n}{a}{m}' '{e}{n}{a}{m}{i}' '{e}{n}{a}{x}'
               omitted - they only occur on 12 words.
            */
        )
    )

    define derivational as (
        [substring] R2 among (
            '{o}{s}{t}'
            '{o}{s}{t}{'}'
                (delete)
        )
    )

    define tidy_up as (
        [substring] among (

            '{e}{i`}{sh}'
            '{e}{i`}{sh}{e}'  // superlative forms
               (delete
                ['{n}'] '{n}' delete
               )
            '{n}'
               ('{n}' delete) // e.g. -nno endings
            '{'}'
               (delete)  // with some slight false conflations
        )
    )
)

define stem as (

    do mark_regions
    backwards setlimit tomark pV for (
        do (
             perfective_gerund or
             ( try reflexive
               adjectival or verb or noun
             )
        )
        try([ '{i}' ] delete)
        // because noun ending -i{iu} is being treated as verb ending -{iu}

        do derivational
        do tidy_up
    )
)