Russian stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball (KOI8-R encoding)
The stemmer in Snowball (Unicode encoding)
The ANSI C stemmer
— and its header
Sample Russian vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Vocabulary + stemmed equivalent (transliterated)
Tar-gzipped file of all of the above

Russian stop word list


The Snowball stemmer represents the Cyrillic alphabet with ASCII characters, following the standard Library of Congress transliteration scheme. The vocabulary + stemmed equivalent is also viewable in this transliterated form.


Here is a sample of Russian vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
в
вавиловка
вагнера
вагон
вагона
вагоне
вагонов
вагоном
вагоны
важная
важнее
важнейшие
важнейшими
важничал
важно
важного
важное
важной
важном
важному
важности
важностию
важность
важностью
важную
важны
важные
важный
важным
важных
вазах
вазы
вакса
вакханка
вал
валандался
валентина
валериановых
валерию
валетами
вали
валил
валился
валится
валов
вальдшнепа
вальс
вальса
вальсе
вальсишку
вальтера
валяется
валялась
валялись
валялось
валялся
валять
валяются
вам
вами
  =>   в
вавиловк
вагнер
вагон
вагон
вагон
вагон
вагон
вагон
важн
важн
важн
важн
важнича
важн
важн
важн
важн
важн
важн
важност
важност
важност
важност
важн
важн
важн
важн
важн
важн
ваз
ваз
вакс
вакханк
вал
валанда
валентин
валерианов
валер
валет
вал
вал
вал
вал
вал
вальдшнеп
вальс
вальс
вальс
вальсишк
вальтер
валя
валя
валя
валя
валя
валя
валя
вам
вам
п
па
пава
павел
павильон
павильонам
павла
павлиний
павлиньи
павлиньим
павлович
павловна
павловне
павловной
павловну
павловны
павловцы
павлыч
павлыча
пагубная
падает
падай
падал
падала
падаль
падать
падаю
падают
падающего
падающие
падеж
падение
падением
падении
падений
падения
паденье
паденья
падет
падут
падучая
падчерицей
падчерицы
падшая
падшей
падшему
падший
падшим
падших
падшую
паек
пазухи
пазуху
пай
пакет
пакетом
пакеты
пакостей
пакостно
пал
  =>   п
па
пав
павел
павильон
павильон
павл
павлин
павлин
павлин
павлович
павловн
павловн
павловн
павловн
павловн
павловц
павлыч
павлыч
пагубн
пада
пада
пада
пада
падал
пада
пада
пада
пада
пада
падеж
паден
паден
паден
паден
паден
паден
паден
падет
падут
падуч
падчериц
падчериц
падш
падш
падш
падш
падш
падш
падш
паек
пазух
пазух
па
пакет
пакет
пакет
пакост
пакостн
пал



 

The stemming algorithm

i-suffixes (*) of Russian tend to be quite regular, with irregularities of declension involving a change to the stem. Irregular forms therefore usually just generate two or more possible stems. Stems in Russian can be very short, and many of the suffixes are also particle words that make ‘natural stopwords’, so a tempting way of running the stemmer is to set a minimum stem length of zero, and thereby reduce to null all words which are made up entirely of suffix parts. We have been a little more cautious, and have insisted that a minimum stem contains one vowel.

The 32 letters of the Russian alphabet are as follows, with the transliterated forms that we will use here shown in brackets:
а (a) б (b) в (v) г (g) д (d) е (e) ж (zh) з (z)
и (i) й (ì) к (k) л (l) м (m) н (n) о (o) п (p)
р (r) с (s) т (t) у (u) ф (f) х (kh) ц (ts) ч (ch)
ш (sh) щ (shch) ъ (") ы (y) ь (') э (è) ю (iu) я (ia)
There is a 33rd letter, ё (ë), but it is rarely used, and we assume it is mapped into е (e).

The following are vowels:
а (a)   е (e)   и (i)   о (o)   у (u)   ы (y)   э (è)   ю (iu)   я (ia)
In any word, RV is the region after the first vowel, or the end of the word if it contains no vowel.

R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.

For example:
    p r o t i v o e s t e s t v e n n o m
         |<------       RV        ------>|
           |<-----       R1       ------>|
               |<-----     R2     ------>|
(See note on R1 and R2.)

We now define the following classes of ending:

PERFECTIVE GERUND:
group 1:   в (v)   вши (vshi)   вшись (vshis')

group 2:   ив (iv)   ивши (ivshi)   ившись (ivshis')   ыв (yv)   ывши (yvshi)   ывшись (yvshis')
group 1 endings must follow а (a) or я (ia)

ADJECTIVE:
ее (ee)   ие (ie)   ые (ye)   ое (oe)   ими (imi)   ыми (ymi)   ей ()   ий ()   ый ()   ой ()   ем (em)   им (im)   ым (ym)   ом (om)   его (ego)   ого (ogo)   ему (emu)   ому (omu)   их (ikh)   ых (ykh)   ую (uiu)   юю (iuiu)   ая (aia)   яя (iaia)   ою (oiu)   ею (eiu)
PARTICIPLE:
group 1:   ем (em)   нн (nn)   вш (vsh)   ющ (iushch)   щ (shch)

group 2:   ивш (ivsh)   ывш (yvsh)   ующ (uiushch)
group 1 endings must follow а (a) or я (ia)

REFLEXIVE:
ся (sia)   сь (s')
VERB:
group 1: ла (la)   на (na)   ете (ete)   йте (ìte)   ли (li)   й (ì)   л (l)   ем (em)   н (n)   ло (lo)   но (no)   ет (et)   ют (iut)   ны (ny)   ть (t')   ешь (esh')   нно (nno)

group 2: ила (ila)   ыла (yla)   ена (ena)   ейте (eìte)   уйте (uìte)   ите (ite)   или (ili)   ыли (yli)   ей ()   уй ()   ил (il)   ыл (yl)   им (im)   ым (ym)   ен (en)   ило (ilo)   ыло (ylo)   ено (eno)   ят (iat)   ует (uet)   уют (uiut)   ит (it)   ыт (yt)   ены (eny)   ить (it')   ыть (yt')   ишь (ish')   ую (uiu)   ю (iu)
group 1 endings must follow а (a) or я (ia)

NOUN:
а (a)   ев (ev)   ов (ov)   ие (ie)   ье ('e)   е (e)   иями (iiami)   ями (iami)   ами (ami)   еи (ei)   ии (ii)   и (i)   ией (ieì)   ей ()   ой ()   ий ()   й (ì)   иям (iiam)   ям (iam)   ием (iem)   ем (em)   ам (am)   ом (om)   о (o)   у (u)   ах (akh)   иях (iiakh)   ях (iakh)   ы (y)   ь (')   ию (iiu)   ью ('iu)   ю (iu)   ия (iia)   ья ('ia)   я (ia)
SUPERLATIVE:
ейш (eìsh)   ейше (eìshe)
These are all i-suffixes. The list of d-suffixes is very short,

DERIVATIONAL:
ост (ost)   ость (ost')
Define an ADJECTIVAL ending as an ADJECTIVE ending optionally preceded by a PARTICIPLE ending.
For example, in
бегавшая = бега + вш + ая
(begavshaia = bega + vsh + aia)
ая (aia) is an adjective ending, and вш (vsh) a participle ending of group 1 (preceded by the final а (a) of бега (bega)), so вшая (vshaia) is an adjectival ending.
In searching for an ending in a class, always choose the longest one from the class.
So in seaching for a NOUN ending for величие (velichie), choose ие (ie) rather than е (e).
Undouble н (n) means, if the word ends нн (nn), remove the last letter.

Here now are the stemming rules.

All tests take place in the the RV part of the word.
So in the test for perfective gerund, the а (a) or я (ia) which the group 1 endings must follow must itself be in RV. In other words the letters before the RV region are never examined in the stemming process.
Do each of steps 1, 2, 3 and 4.

Step 1: Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending. As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.

Step 2: If the word ends with и (i), remove it.

Step 3: Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.

Step 4: (1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.


 

The same algorithm in Snowball


stringescapes {} /* the 32 Cyrillic letters in the KOI8-R coding scheme, and represented in Latin characters following the conventions of the standard Library of Congress transliteration: */ stringdef a hex 'C1' stringdef b hex 'C2' stringdef v hex 'D7' stringdef g hex 'C7' stringdef d hex 'C4' stringdef e hex 'C5' stringdef zh hex 'D6' stringdef z hex 'DA' stringdef i hex 'C9' stringdef i` hex 'CA' stringdef k hex 'CB' stringdef l hex 'CC' stringdef m hex 'CD' stringdef n hex 'CE' stringdef o hex 'CF' stringdef p hex 'D0' stringdef r hex 'D2' stringdef s hex 'D3' stringdef t hex 'D4' stringdef u hex 'D5' stringdef f hex 'C6' stringdef kh hex 'C8' stringdef ts hex 'C3' stringdef ch hex 'DE' stringdef sh hex 'DB' stringdef shch hex 'DD' stringdef " hex 'DF' stringdef y hex 'D9' stringdef ' hex 'D8' stringdef e` hex 'DC' stringdef iu hex 'C0' stringdef ia hex 'D1' routines ( mark_regions R2 perfective_gerund adjective adjectival reflexive verb noun derivational tidy_up ) externals ( stem ) integers ( pV p2 ) groupings ( v ) define v '{a}{e}{i}{o}{u}{y}{e`}{iu}{ia}' define mark_regions as ( $pV = limit $p2 = limit do ( gopast v setmark pV gopast non-v gopast v gopast non-v setmark p2 ) ) backwardmode ( define R2 as $p2 <= cursor define perfective_gerund as ( [substring] among ( '{v}' '{v}{sh}{i}' '{v}{sh}{i}{s}{'}' ('{a}' or '{ia}' delete) '{i}{v}' '{i}{v}{sh}{i}' '{i}{v}{sh}{i}{s}{'}' '{y}{v}' '{y}{v}{sh}{i}' '{y}{v}{sh}{i}{s}{'}' (delete) ) ) define adjective as ( [substring] among ( '{e}{e}' '{i}{e}' '{y}{e}' '{o}{e}' '{i}{m}{i}' '{y}{m}{i}' '{e}{i`}' '{i}{i`}' '{y}{i`}' '{o}{i`}' '{e}{m}' '{i}{m}' '{y}{m}' '{o}{m}' '{e}{g}{o}' '{o}{g}{o}' '{e}{m}{u}' '{o}{m}{u}' '{i}{kh}' '{y}{kh}' '{u}{iu}' '{iu}{iu}' '{a}{ia}' '{ia}{ia}' // and - '{o}{iu}' // - which is somewhat archaic '{e}{iu}' // - soft form of {o}{iu} (delete) ) ) define adjectival as ( adjective /* of the participle forms, em, vsh, ivsh, yvsh are readily removable. nn, {iu}shch, shch, u{iu}shch can be removed, with a small proportion of errors. Removing im, uem, enn creates too many errors. */ try ( [substring] among ( '{e}{m}' // present passive participle '{n}{n}' // adjective from past passive participle '{v}{sh}' // past active participle '{iu}{shch}' '{shch}' // present active participle ('{a}' or '{ia}' delete) //but not '{i}{m}' '{u}{e}{m}' // present passive participle //or '{e}{n}{n}' // adjective from past passive participle '{i}{v}{sh}' '{y}{v}{sh}'// past active participle '{u}{iu}{shch}' // present active participle (delete) ) ) ) define reflexive as ( [substring] among ( '{s}{ia}' '{s}{'}' (delete) ) ) define verb as ( [substring] among ( '{l}{a}' '{n}{a}' '{e}{t}{e}' '{i`}{t}{e}' '{l}{i}' '{i`}' '{l}' '{e}{m}' '{n}' '{l}{o}' '{n}{o}' '{e}{t}' '{iu}{t}' '{n}{y}' '{t}{'}' '{e}{sh}{'}' '{n}{n}{o}' ('{a}' or '{ia}' delete) '{i}{l}{a}' '{y}{l}{a}' '{e}{n}{a}' '{e}{i`}{t}{e}' '{u}{i`}{t}{e}' '{i}{t}{e}' '{i}{l}{i}' '{y}{l}{i}' '{e}{i`}' '{u}{i`}' '{i}{l}' '{y}{l}' '{i}{m}' '{y}{m}' '{e}{n}' '{i}{l}{o}' '{y}{l}{o}' '{e}{n}{o}' '{ia}{t}' '{u}{e}{t}' '{u}{iu}{t}' '{i}{t}' '{y}{t}' '{e}{n}{y}' '{i}{t}{'}' '{y}{t}{'}' '{i}{sh}{'}' '{u}{iu}' '{iu}' (delete) /* note the short passive participle tests: '{n}{a}' '{n}' '{n}{o}' '{n}{y}' '{e}{n}{a}' '{e}{n}' '{e}{n}{o}' '{e}{n}{y}' */ ) ) define noun as ( [substring] among ( '{a}' '{e}{v}' '{o}{v}' '{i}{e}' '{'}{e}' '{e}' '{i}{ia}{m}{i}' '{ia}{m}{i}' '{a}{m}{i}' '{e}{i}' '{i}{i}' '{i}' '{i}{e}{i`}' '{e}{i`}' '{o}{i`}' '{i}{i`}' '{i`}' '{i}{ia}{m}' '{ia}{m}' '{i}{e}{m}' '{e}{m}' '{a}{m}' '{o}{m}' '{o}' '{u}' '{a}{kh}' '{i}{ia}{kh}' '{ia}{kh}' '{y}' '{'}' '{i}{iu}' '{'}{iu}' '{iu}' '{i}{ia}' '{'}{ia}' '{ia}' (delete) /* the small class of neuter forms '{e}{n}{i}' '{e}{n}{e}{m}' '{e}{n}{a}' '{e}{n}' '{e}{n}{a}{m}' '{e}{n}{a}{m}{i}' '{e}{n}{a}{x}' omitted - they only occur on 12 words. */ ) ) define derivational as ( [substring] R2 among ( '{o}{s}{t}' '{o}{s}{t}{'}' (delete) ) ) define tidy_up as ( [substring] among ( '{e}{i`}{sh}' '{e}{i`}{sh}{e}' // superlative forms (delete ['{n}'] '{n}' delete ) '{n}' ('{n}' delete) // e.g. -nno endings '{'}' (delete) // with some slight false conflations ) ) ) define stem as ( do mark_regions backwards setlimit tomark pV for ( do ( perfective_gerund or ( try reflexive adjectival or verb or noun ) ) try([ '{i}' ] delete) // because noun ending -i{iu} is being treated as verb ending -{iu} do derivational do tidy_up ) )