Dutch stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample Dutch vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

Dutch stop word list
The stemmer in Snowball — MS DOS Latin I encodings
Germanic language stemmers


Here is a sample of Dutch vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
lichaamsziek
lichamelijk
lichamelijke
lichamelijkheden
lichamen
lichere
licht
lichtbeeld
lichtbruin
lichtdoorlatende
lichte
lichten
lichtende
lichtenvoorde
lichter
lichtere
lichters
lichtgevoeligheid
lichtgewicht
lichtgrijs
lichthoeveelheid
lichtintensiteit
lichtje
lichtjes
lichtkranten
lichtkring
lichtkringen
lichtregelsystemen
lichtste
lichtstromende
lichtte
lichtten
lichttoetreding
lichtverontreinigde
lichtzinnige
lid
lidia
lidmaatschap
lidstaten
lidvereniging
  =>   lichaamsziek
licham
licham
licham
licham
licher
licht
lichtbeeld
lichtbruin
lichtdoorlat
licht
licht
lichtend
lichtenvoord
lichter
lichter
lichter
lichtgevoel
lichtgewicht
lichtgrijs
lichthoevel
lichtintensiteit
lichtj
lichtjes
lichtkrant
lichtkring
lichtkring
lichtregelsystem
lichtst
lichtstrom
licht
licht
lichttoetred
lichtverontreinigd
lichtzinn
lid
lidia
lidmaatschap
lidstat
lidveren
opgingen
opglanzing
opglanzingen
opglimlachten
opglimpen
opglimpende
opglimping
opglimpingen
opgraven
opgrijnzen
opgrijzende
opgroeien
opgroeiende
opgroeiplaats
ophaal
ophaaldienst
ophaalkosten
ophaalsystemen
ophaalt
ophaaltruck
ophalen
ophalend
ophalers
ophef
opheffen
opheffende
opheffing
opheldering
ophemelde
ophemelen
opheusden
ophief
ophield
ophieven
ophoepelt
ophoog
ophoogzand
ophopen
ophoping
ophouden
  =>   opging
opglanz
opglanz
opglimlacht
opglimp
opglimp
opglimp
opglimp
opgrav
opgrijnz
opgrijz
opgroei
opgroei
opgroeiplat
ophal
ophaaldienst
ophaalkost
ophaalsystem
ophaalt
ophaaltruck
ophal
ophal
ophaler
ophef
opheff
opheff
opheff
ophelder
ophemeld
ophemel
opheusd
ophief
ophield
ophiev
ophoepelt
ophog
ophoogzand
ophop
ophop
ophoud



 

The stemming algorithm

Dutch includes the following accented forms
ä   ë   ï   ö   ü   á   é   í   ó   ú   è
First, remove all umlaut and acute accents. A vowel is then one of,
a   e   i   o   u   y   è
Put initial y, y after a vowel, and i between vowels into upper case. R1 and R2 (see the note on R1 and R2) are then defined as in German.

Define a valid s-ending as a non-vowel other than j.

Define a valid en-ending as a non-vowel, and not gem.

Define undoubling the ending as removing the last letter if the word ends kk, dd or tt.

Do each of steps 1, 2 3 and 4.

Step 1:
Search for the longest among the following suffixes, and perform the action indicated

(a) heden
replace with heid if in R1

(b) en   ene
delete if in R1 and preceded by a valid en-ending, and then undouble the ending

(c) s   se
delete if in R1 and preceded by a valid s-ending
Step 2:
Delete suffix e if in R1 and preceded by a non-vowel, and then undouble the ending
Step 3a: heid
delete heid if in R2 and not preceded by c, and treat a preceding en as in step 1(b)
Step 3b: d-suffixes (*)
Search for the longest among the following suffixes, and perform the action indicated.

end   ing
delete if in R2
if preceded by ig, delete if in R2 and not preceded by e, otherwise undouble the ending

ig
delete if in R2 and not preceded by e

lijk
delete if in R2, and then repeat step 2

baar
delete if in R2

bar
delete if in R2 and if step 2 actually removed an e
Step 4: undouble vowel
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, maan -> man, brood -> brod).
Finally,
Turn I and Y back into lower case.

 

The same algorithm in Snowball


routines ( prelude postlude e_ending en_ending mark_regions R1 R2 undouble standard_suffix ) externals ( stem ) booleans ( e_found ) integers ( p1 p2 ) groupings ( v v_I v_j ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a" hex 'E4' stringdef e" hex 'EB' stringdef i" hex 'EF' stringdef o" hex 'F6' stringdef u" hex 'FC' stringdef a' hex 'E1' stringdef e' hex 'E9' stringdef i' hex 'ED' stringdef o' hex 'F3' stringdef u' hex 'FA' stringdef e` hex 'E8' define v 'aeiouy{e`}' define v_I v + 'I' define v_j v + 'j' define prelude as ( test repeat ( [substring] among( '{a"}' '{a'}' (<- 'a') '{e"}' '{e'}' (<- 'e') '{i"}' '{i'}' (<- 'i') '{o"}' '{o'}' (<- 'o') '{u"}' '{u'}' (<- 'u') '' (next) ) //or next ) try(['y'] <- 'Y') repeat goto ( v [('i'] v <- 'I') or ('y'] <- 'Y') ) ) define mark_regions as ( $p1 = limit $p2 = limit gopast v gopast non-v setmark p1 try($p1 < 3 $p1 = 3) // at least 3 gopast v gopast non-v setmark p2 ) define postlude as repeat ( [substring] among( 'Y' (<- 'y') 'I' (<- 'i') '' (next) ) //or next ) backwardmode ( define R1 as $p1 <= cursor define R2 as $p2 <= cursor define undouble as ( test among('kk' 'dd' 'tt') [next] delete ) define e_ending as ( unset e_found ['e'] R1 test non-v delete set e_found undouble ) define en_ending as ( R1 non-v and not 'gem' delete undouble ) define standard_suffix as ( do ( [substring] among( 'heden' ( R1 <- 'heid' ) 'en' 'ene' ( en_ending ) 's' 'se' ( R1 non-v_j delete ) ) ) do e_ending do ( ['heid'] R2 not 'c' delete ['en'] en_ending ) do ( [substring] among( 'end' 'ing' ( R2 delete (['ig'] R2 not 'e' delete) or undouble ) 'ig' ( R2 not 'e' delete ) 'lijk' ( R2 delete e_ending ) 'baar' ( R2 delete ) 'bar' ( R2 e_found delete ) ) ) do ( non-v_I test ( among ('aa' 'ee' 'oo' 'uu') non-v ) [next] delete ) ) ) define stem as ( do prelude do mark_regions backwards do standard_suffix do postlude )