The stemming algorithm
i-suffixes (*) of Russian tend to be quite regular, with irregularities of
declension involving a change to the stem. Irregular forms therefore
usually just generate two or more possible stems. Stems in Russian can
be very short, and many of the suffixes are also particle words that make
‘natural stopwords’, so a tempting way of running the stemmer is to set a
minimum stem length of zero, and thereby reduce to null all words which
are made up entirely of suffix parts. We have been a little more cautious,
and have insisted that a minimum stem contains one vowel.
The 32 letters of the Russian alphabet are as follows, with the
transliterated forms that we will use here shown in brackets:
а (a)
| | б (b)
| | в (v)
| | г (g)
| | д (d)
| | е (e)
| | ж (zh)
| | з (z)
| и (i)
| | й (ì)
| | к (k)
| | л (l)
| | м (m)
| | н (n)
| | о (o)
| | п (p)
| р (r)
| | с (s)
| | т (t)
| | у (u)
| | ф (f)
| | х (kh)
| | ц (ts)
| | ч (ch)
| ш (sh)
| | щ (shch)
| | ъ (")
| | ы (y)
| | ь (')
| | э (è)
| | ю (iu)
| | я (ia)
|
There is a 33rd letter, ё (ë), but it is rarely used, and we assume it is mapped into е (e).
The following are vowels:
-
а (a) е (e) и (i) о (o) у (u) ы (y)
э (è) ю (iu) я (ia)
In any word, RV is the region after the first vowel, or the end of the word
if it contains no vowel.
R1 is the region after the first non-vowel following a vowel, or the end of
the word if there is no such non-vowel.
R2 is the region after the first non-vowel following a vowel in R1, or the
end of the word if there is no such non-vowel.
For example:
p r o t i v o e s t e s t v e n n o m
|<------ RV ------>|
|<----- R1 ------>|
|<----- R2 ------>|
(See note on R1 and R2.)
We now define the following classes of ending:
PERFECTIVE GERUND:
-
group 1: в (v) вши (vshi) вшись (vshis')
group 2: ив (iv) ивши (ivshi) ившись (ivshis')
ыв (yv) ывши (yvshi) ывшись (yvshis')
group 1 endings must follow а (a) or я (ia)
ADJECTIVE:
-
ее (ee) ие (ie) ые (ye) ое (oe) ими (imi) ыми
(ymi) ей (eì) ий (iì) ый (yì) ой (oì) ем
(em) им (im) ым (ym) ом (om) его (ego) ого (ogo)
ему (emu) ому (omu) их (ikh) ых (ykh) ую (uiu)
юю (iuiu) ая (aia) яя (iaia)
ою (oiu)
ею (eiu)
PARTICIPLE:
-
group 1: ем (em) нн (nn) вш (vsh) ющ (iushch) щ (shch)
group 2: ивш (ivsh) ывш (yvsh) ующ (uiushch)
group 1 endings must follow а (a) or я (ia)
REFLEXIVE:
-
ся (sia) сь (s')
VERB:
-
group 1: ла (la) на (na) ете (ete) йте (ìte) ли (li)
й (ì) л (l) ем (em) н (n) ло (lo) но (no) ет
(et) ют (iut) ны (ny) ть (t') ешь (esh') нно (nno)
group 2: ила (ila) ыла (yla) ена (ena) ейте (eìte)
уйте (uìte) ите (ite) или (ili) ыли
(yli) ей (eì) уй (uì) ил (il) ыл (yl) им (im)
ым (ym) ен (en) ило (ilo) ыло (ylo) ено (eno) ят
(iat) ует (uet) уют (uiut) ит (it) ыт (yt) ены
(eny) ить (it') ыть (yt') ишь (ish')
ую (uiu) ю (iu)
group 1 endings must follow а (a) or я (ia)
NOUN:
-
а (a) ев (ev) ов (ov) ие (ie) ье ('e) е (e) иями
(iiami) ями (iami) ами (ami) еи (ei) ии (ii) и (i)
ией (ieì) ей (eì) ой (oì) ий (iì) й (ì)
иям (iiam) ям (iam) ием (iem) ем (em) ам (am) ом
(om) о (o) у (u) ах (akh) иях (iiakh) ях (iakh) ы
(y) ь (') ию (iiu) ью ('iu) ю (iu) ия (iia) ья
('ia) я (ia)
SUPERLATIVE:
-
ейш (eìsh) ейше (eìshe)
These are all i-suffixes. The list of d-suffixes is very short,
DERIVATIONAL:
-
ост (ost) ость (ost')
Define an ADJECTIVAL ending as an ADJECTIVE ending optionally preceded
by a PARTICIPLE ending.
-
For example, in
бегавшая | | = | | бега | | + | | вш | | + | | ая
| (begavshaia | | = | | bega | | + | | vsh | | + | | aia)
|
ая (aia) is an adjective ending, and вш (vsh) a participle ending of group 1
(preceded by the final а (a) of бега (bega)), so вшая (vshaia) is an
adjectival ending.
In searching for an ending in a class, always choose the longest one
from the class.
-
So in seaching for a NOUN ending for величие (velichie), choose ие (ie) rather than
е (e).
Undouble н (n) means, if the word ends нн (nn), remove the last letter.
Here now are the stemming rules.
All tests take place in the the RV part of the word.
-
So in the test for perfective gerund, the а (a) or я (ia) which the group 1
endings must follow must itself be in RV. In other words the letters
before the RV region are never examined in the stemming process.
Do each of steps 1, 2, 3 and 4.
Step 1:
Search for a PERFECTIVE GERUND ending. If one is found remove it, and that
is then the end of step 1. Otherwise try and remove a REFLEXIVE ending,
and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a
NOUN ending. As soon as one of the endings (1) to (3) is found remove it,
and terminate step 1.
Step 2: If the word ends with и (i), remove it.
Step 3: Search for a DERIVATIONAL ending in R2 (i.e. the entire ending
must lie in R2), and if one is found, remove it.
Step 4: (1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending,
remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.
|
The same algorithm in Snowball
-
stringescapes {}
/* the 32 Cyrillic letters in the KOI8-R coding scheme, and represented
in Latin characters following the conventions of the standard Library
of Congress transliteration: */
stringdef a hex 'C1'
stringdef b hex 'C2'
stringdef v hex 'D7'
stringdef g hex 'C7'
stringdef d hex 'C4'
stringdef e hex 'C5'
stringdef zh hex 'D6'
stringdef z hex 'DA'
stringdef i hex 'C9'
stringdef i` hex 'CA'
stringdef k hex 'CB'
stringdef l hex 'CC'
stringdef m hex 'CD'
stringdef n hex 'CE'
stringdef o hex 'CF'
stringdef p hex 'D0'
stringdef r hex 'D2'
stringdef s hex 'D3'
stringdef t hex 'D4'
stringdef u hex 'D5'
stringdef f hex 'C6'
stringdef kh hex 'C8'
stringdef ts hex 'C3'
stringdef ch hex 'DE'
stringdef sh hex 'DB'
stringdef shch hex 'DD'
stringdef " hex 'DF'
stringdef y hex 'D9'
stringdef ' hex 'D8'
stringdef e` hex 'DC'
stringdef iu hex 'C0'
stringdef ia hex 'D1'
routines ( mark_regions R2
perfective_gerund
adjective
adjectival
reflexive
verb
noun
derivational
tidy_up
)
externals ( stem )
integers ( pV p2 )
groupings ( v )
define v '{a}{e}{i}{o}{u}{y}{e`}{iu}{ia}'
define mark_regions as (
$pV = limit
$p2 = limit
do (
gopast v setmark pV gopast non-v
gopast v gopast non-v setmark p2
)
)
backwardmode (
define R2 as $p2 <= cursor
define perfective_gerund as (
[substring] among (
'{v}'
'{v}{sh}{i}'
'{v}{sh}{i}{s}{'}'
('{a}' or '{ia}' delete)
'{i}{v}'
'{i}{v}{sh}{i}'
'{i}{v}{sh}{i}{s}{'}'
'{y}{v}'
'{y}{v}{sh}{i}'
'{y}{v}{sh}{i}{s}{'}'
(delete)
)
)
define adjective as (
[substring] among (
'{e}{e}' '{i}{e}' '{y}{e}' '{o}{e}' '{i}{m}{i}' '{y}{m}{i}'
'{e}{i`}' '{i}{i`}' '{y}{i`}' '{o}{i`}' '{e}{m}' '{i}{m}'
'{y}{m}' '{o}{m}' '{e}{g}{o}' '{o}{g}{o}' '{e}{m}{u}'
'{o}{m}{u}' '{i}{kh}' '{y}{kh}' '{u}{iu}' '{iu}{iu}' '{a}{ia}'
'{ia}{ia}'
// and -
'{o}{iu}' // - which is somewhat archaic
'{e}{iu}' // - soft form of {o}{iu}
(delete)
)
)
define adjectival as (
adjective
/* of the participle forms, em, vsh, ivsh, yvsh are readily removable.
nn, {iu}shch, shch, u{iu}shch can be removed, with a small proportion of
errors. Removing im, uem, enn creates too many errors.
*/
try (
[substring] among (
'{e}{m}' // present passive participle
'{n}{n}' // adjective from past passive participle
'{v}{sh}' // past active participle
'{iu}{shch}' '{shch}' // present active participle
('{a}' or '{ia}' delete)
//but not '{i}{m}' '{u}{e}{m}' // present passive participle
//or '{e}{n}{n}' // adjective from past passive participle
'{i}{v}{sh}' '{y}{v}{sh}'// past active participle
'{u}{iu}{shch}' // present active participle
(delete)
)
)
)
define reflexive as (
[substring] among (
'{s}{ia}'
'{s}{'}'
(delete)
)
)
define verb as (
[substring] among (
'{l}{a}' '{n}{a}' '{e}{t}{e}' '{i`}{t}{e}' '{l}{i}' '{i`}'
'{l}' '{e}{m}' '{n}' '{l}{o}' '{n}{o}' '{e}{t}' '{iu}{t}'
'{n}{y}' '{t}{'}' '{e}{sh}{'}'
'{n}{n}{o}'
('{a}' or '{ia}' delete)
'{i}{l}{a}' '{y}{l}{a}' '{e}{n}{a}' '{e}{i`}{t}{e}'
'{u}{i`}{t}{e}' '{i}{t}{e}' '{i}{l}{i}' '{y}{l}{i}' '{e}{i`}'
'{u}{i`}' '{i}{l}' '{y}{l}' '{i}{m}' '{y}{m}' '{e}{n}'
'{i}{l}{o}' '{y}{l}{o}' '{e}{n}{o}' '{ia}{t}' '{u}{e}{t}'
'{u}{iu}{t}' '{i}{t}' '{y}{t}' '{e}{n}{y}' '{i}{t}{'}'
'{y}{t}{'}' '{i}{sh}{'}' '{u}{iu}' '{iu}'
(delete)
/* note the short passive participle tests:
'{n}{a}' '{n}' '{n}{o}' '{n}{y}'
'{e}{n}{a}' '{e}{n}' '{e}{n}{o}' '{e}{n}{y}'
*/
)
)
define noun as (
[substring] among (
'{a}' '{e}{v}' '{o}{v}' '{i}{e}' '{'}{e}' '{e}'
'{i}{ia}{m}{i}' '{ia}{m}{i}' '{a}{m}{i}' '{e}{i}' '{i}{i}'
'{i}' '{i}{e}{i`}' '{e}{i`}' '{o}{i`}' '{i}{i`}' '{i`}'
'{i}{ia}{m}' '{ia}{m}' '{i}{e}{m}' '{e}{m}' '{a}{m}' '{o}{m}'
'{o}' '{u}' '{a}{kh}' '{i}{ia}{kh}' '{ia}{kh}' '{y}' '{'}'
'{i}{iu}' '{'}{iu}' '{iu}' '{i}{ia}' '{'}{ia}' '{ia}'
(delete)
/* the small class of neuter forms '{e}{n}{i}' '{e}{n}{e}{m}'
'{e}{n}{a}' '{e}{n}' '{e}{n}{a}{m}' '{e}{n}{a}{m}{i}' '{e}{n}{a}{x}'
omitted - they only occur on 12 words.
*/
)
)
define derivational as (
[substring] R2 among (
'{o}{s}{t}'
'{o}{s}{t}{'}'
(delete)
)
)
define tidy_up as (
[substring] among (
'{e}{i`}{sh}'
'{e}{i`}{sh}{e}' // superlative forms
(delete
['{n}'] '{n}' delete
)
'{n}'
('{n}' delete) // e.g. -nno endings
'{'}'
(delete) // with some slight false conflations
)
)
)
define stem as (
do mark_regions
backwards setlimit tomark pV for (
do (
perfective_gerund or
( try reflexive
adjectival or verb or noun
)
)
try([ '{i}' ] delete)
// because noun ending -i{iu} is being treated as verb ending -{iu}
do derivational
do tidy_up
)
)
|