Swedish stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample Swedish vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

Swedish stop word list
The stemmer in Snowball — MS DOS Latin I encodings
Scandinavian language stemmers


Here is a sample of Swedish vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
jakt
jaktbössa
jakten
jakthund
jaktkarl
jaktkarlar
jaktkarlarne
jaktkarlens
jaktlöjtnant
jaktlöjtnanten
jaktlöjtnantens
jalusi
jalusien
jalusier
jalusierna
jamaika
jamat
jamrande
jamt
jande
januari
japanska
jaquette
jaquettekappa
jargong
jasmin
jasminen
jasminer
jasminhäck
jaspis
jaså
javäl
jazzvindens
jcrn
jcsus
je
jemföra
jemföras
jemförelse
jemförelser
  =>   jakt
jaktböss
jakt
jakthund
jaktkarl
jaktkarl
jaktkarl
jaktkarl
jaktlöjtnant
jaktlöjtnant
jaktlöjtnant
jalusi
jalusi
jalusi
jalusi
jamaik
jam
jamr
jamt
jand
januari
japansk
jaquet
jaquettekapp
jargong
jasmin
jasmin
jasmin
jasminhäck
jaspis
jaså
javäl
jazzvind
jcrn
jcsus
je
jemför
jemför
jemför
jemför
klo
kloaken
klock
klocka
klockan
klockans
klockare
klockaren
klockarens
klockarfar
klockarn
klockarsonen
klockas
klockkedjan
klocklikt
klockor
klockorna
klockornas
klockors
klockringning
kloekornas
klok
kloka
klokare
klokast
klokaste
kloke
klokhet
klokheten
klokt
kloliknande
klor
klorna
kloroform
kloster
klostergården
klosterlik
klot
klotb
klotrund
  =>   klo
kloak
klock
klock
klockan
klockan
klock
klock
klock
klockarf
klockarn
klockarson
klock
klockkedjan
klocklik
klock
klock
klock
klockor
klockringning
kloek
klok
klok
klok
klok
klok
klok
klok
klok
klokt
klolikn
klor
klorn
kloroform
klost
klostergård
klosterlik
klot
klotb
klotrund



 

The stemming algorithm

The Swedish alphabet includes the following additional letters,
ä   å   ö
The following letters are vowels:
a   e   i   o   u   y   ä   å   ö
R2 is not used: R1 is defined in the same way as in the German stemmer. (See the note on R1 and R2.)

Define a valid s-ending as one of
b   c   d   f   g   h   j   k   l   m   n   o   p   r   t   v   y
Do each of steps 1, 2 and 3.

Step 1:
Search for the longest among the following suffixes in R1, and perform the action indicated.

(a) a   arna   erna   heterna   orna   ad   e   ade   ande   arne   are   aste   en   anden   aren   heten   ern   ar   er   heter   or   as   arnas   ernas   ornas   es   ades   andes   ens   arens   hetens   erns   at   andet   het   ast
delete

(b) s
delete if preceded by a valid s-ending

(Of course the letter of the valid s-ending is not necessarily in R1)
Step 2:
Search for one of the following suffixes in R1, and if found delete the last letter.

dd   gd   nn   dt   gt   kt   tt

(For example, friskt -> frisk, fröknarnn -> fröknarn)
Step 3:
Search for the longest among the following suffixes in R1, and perform the action indicated.

lig   ig   els
delete

löst
replace with lös

fullt
replace with full

 

The same algorithm in Snowball


routines ( mark_regions main_suffix consonant_pair other_suffix ) externals ( stem ) integers ( p1 x ) groupings ( v s_ending ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a" hex 'E4' stringdef ao hex 'E5' stringdef o" hex 'F6' define v 'aeiouy{a"}{ao}{o"}' define s_ending 'bcdfghjklmnoprtvy' define mark_regions as ( $p1 = limit test ( hop 3 setmark x ) goto v gopast non-v setmark p1 try ( $p1 < x $p1 = x ) ) backwardmode ( define main_suffix as ( setlimit tomark p1 for ([substring]) among( 'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne' 'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter' 'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens' 'hetens' 'erns' 'at' 'andet' 'het' 'ast' (delete) 's' (s_ending delete) ) ) define consonant_pair as setlimit tomark p1 for ( among('dd' 'gd' 'nn' 'dt' 'gt' 'kt' 'tt') and ([next] delete) ) define other_suffix as setlimit tomark p1 for ( [substring] among( 'lig' 'ig' 'els' (delete) 'l{o"}st' (<-'l{o"}s') 'fullt' (<-'full') ) ) ) define stem as ( do mark_regions backwards ( do main_suffix do consonant_pair do other_suffix ) )