The stemming algorithm
Dutch includes the following accented forms
-
ä ë ï ö ü á é í ó ú è
First, remove all umlaut and acute accents. A vowel is then one of,
-
a e i o u y è
Put initial y, y after a vowel, and
i between vowels into upper case. R1 and
R2
(see the note on R1 and R2)
are then defined as in German.
Define a valid s-ending as a non-vowel other than j.
Define a valid en-ending as a non-vowel, and not gem.
Define undoubling the ending as removing the last letter if the word ends
kk, dd or tt.
Do each of steps 1, 2 3 and 4.
Step 1:
-
Search for the longest among the following suffixes, and perform the
action indicated
- (a) heden
- replace with heid if in R1
- (b) en ene
- delete if in R1 and preceded by a valid en-ending, and then
undouble the ending
- (c) s se
- delete if in R1 and preceded by a valid s-ending
Step 2:
-
Delete suffix e if in R1 and preceded by a non-vowel, and then undouble
the ending
Step 3a: heid
-
delete heid if in R2 and not preceded by c, and treat a preceding
en as in step 1(b)
Step 3b: d-suffixes (*)
-
Search for the longest among the following suffixes, and perform the
action indicated.
- end ing
- delete if in R2
- if preceded by ig, delete if in R2 and not preceded by e, otherwise
undouble the ending
- ig
- delete if in R2 and not preceded by e
- lijk
- delete if in R2, and then repeat step 2
- baar
- delete if in R2
- bar
- delete if in R2 and if step 2 actually removed an e
Step 4: undouble vowel
-
If the words ends CVD, where C is a non-vowel, D is a non-vowel other
than I, and V is double a, e, o or u, remove one of the vowels from
V (for example, maan -> man, brood -> brod).
Finally,
-
Turn I and Y back into lower case.
|