The stemming algorithm
Letters in Spanish include the following accented forms,
-
á é í ó ú ü ñ
The following letters are vowels:
-
a e i o u á é í ó ú ü
R2 is defined in the usual way —
see the note on R1 and R2.
RV is defined as follows (and this is not the same as the
French stemmer
definition):
If the second letter is a consonant, RV is the region after the next
following vowel, or if the first two letters are vowels, RV is the region
after the next consonant, and otherwise (consonant-vowel case) RV is the
region after the third letter. But RV is the end of the word if these
positions cannot be found.
For example,
m a c h o o l i v a t r a b a j o á u r e o
|...| |...| |.......| |...|
Always do steps 0 and 1.
Step 0: Attached pronoun
-
Search for the longest among the following suffixes
-
me se sela selo selas selos la le lo las les los nos
and delete it, if comes after one of
-
(a) iéndo ándo ár ér ír
(b) ando iendo ar er ir
(c) yendo following u
in RV. In the case of (c), yendo must lie in RV, but the preceding
u can be outside it.
In the case of (a), deletion is followed by removing the acute accent
(for example, haciéndola -> haciendo).
Step 1: Standard suffix removal
-
Search for the longest among the following suffixes, and perform the
action indicated.
- anza anzas ico ica icos icas ismo ismos able ables ible ibles ista
istas oso osa osos osas amiento amientos imiento
imientos
- delete if in R2
- adora ador ación adoras adores aciones ante antes ancia ancias
- delete if in R2
- if preceded by ic, delete if in R2
- logía logías
- replace with log if in R2
- ución uciones
- replace with u if in R2
- encia encias
- replace with ente if in R2
- amente
- delete if in R1
- if preceded by iv, delete if in R2 (and if further preceded by at,
delete if in R2), otherwise,
- if preceded by os, ic or ad, delete if in R2
- mente
- delete if in R2
- if preceded by ante, able or ible, delete if in R2
- idad idades
- delete if in R2
- if preceded by abil, ic or iv, delete if in R2
- iva ivo ivas ivos
- delete if in R2
- if preceded by at, delete if in R2
Do step 2a if no ending was removed by step 1.
Step 2a: Verb suffixes beginning y
-
Search for the longest among the following suffixes in RV, and if found,
delete if preceded by u.
-
ya ye yan yen yeron yendo yo yó yas yes yais
yamos
(Note that the preceding u need not be in RV.)
Do Step 2b if step 2a was done, but failed to remove a suffix.
Step 2b: Other verb suffixes
-
Search for the longest among the following suffixes in RV, and perform the
action indicated.
- en es éis emos
- delete, and if preceded by gu delete the u (the gu need not be in
RV)
- arían arías arán arás aríais aría aréis aríamos aremos
ará aré
erían erías erán erás eríais ería eréis eríamos eremos
erá eré
irían irías irán irás iríais iría iréis iríamos iremos
irá iré
aba ada ida ía ara iera ad ed id ase iese aste iste an aban ían
aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as
abas adas idas ías aras ieras ases ieses ís áis abais íais
arais ierais aseis ieseis asteis isteis ados idos amos ábamos
íamos imos áramos iéramos iésemos ásemos
- delete
Always do step 3.
Step 3: residual suffix
-
Search for the longest among the following suffixes in RV, and perform the
action indicated.
- os a o á í ó
- delete if in RV
- e é
- delete if in RV, and if preceded by gu with the u in RV delete the u
And finally:
-
Remove acute accents
|