The stemming algorithm
Letters in Portuguese include the following accented forms,
-
á é í ó ú â ê ô ç ã õ ü
ñ
The following letters are vowels:
-
a e i o u á é í ó ú â ê ô
And the two nasalised vowel forms,
-
ã õ
should be treated as a vowel followed by a consonant.
ã and õ are therefore replaced by a~ and o~ in the word, where ~ is a
separate character to be treated as a consonant. And then —
R2
(see the note on R1 and R2)
and RV have the same definition as in the
Spanish stemmer.
Always do step 1.
Step 1: Standard suffix removal
-
Search for the longest among the following suffixes, and perform the
action indicated.
- eza ezas ico ica icos icas ismo ismos
ável ível ista istas oso osa
osos osas amento amentos imento imentos
adora ador aça~o adoras adores aço~es
ante antes ância
- delete if in R2
- logía logías
- replace with log if in R2
- ución uciones
- replace with u if in R2
- ência ências
- replace with ente if in R2
- amente
- delete if in R1
- if preceded by iv, delete if in R2 (and if further preceded by at,
delete if in R2), otherwise,
- if preceded by os, ic or ad, delete if in R2
- mente
- delete if in R2
- if preceded by ante, avel or ível, delete if in R2
- idade idades
- delete if in R2
- if preceded by abil, ic or iv, delete if in R2
- iva ivo ivas ivos
- delete if in R2
- if preceded by at, delete if in R2
- ira iras
- replace with ir if in RV and preceded by e
Do step 2 if no ending was removed by step 1.
Step 2: Verb suffixes
-
Search for the longest among the following suffixes in RV, and if found,
delete.
-
ada ida ia aria eria iria ará ara erá era irá ava asse esse
isse aste este iste ei arei erei irei am iam ariam eriam iriam
aram eram iram avam em arem erem irem assem essem issem ado ido
ando endo indo ara~o era~o ira~o ar er ir as adas idas ias arias
erias irias arás aras erás eras irás avas es ardes erdes
irdes ares eres ires asses esses isses astes estes istes is ais
eis íeis aríeis eríeis iríeis áreis areis éreis ereis
íreis ireis ásseis ésseis ísseis áveis ados idos ámos
amos íamos aríamos eríamos iríamos áramos éramos
íramos ávamos emos aremos eremos iremos ássemos êssemos
íssemos imos armos ermos irmos eu iu ou ira
iras
If the last step to be obeyed — either step 1 or 2 — altered the word,
do step 3
Step 3
-
Delete suffix i if in RV and preceded by c
Alternatively, if neither steps 1 nor 2 altered the word, do step 4
Step 4: Residual suffix
-
If the word ends with one of the suffixes
-
os a i o á í ó
in RV, delete it
Always do step 5
Step 5:
-
If the word ends with one of
-
e é ê
in RV, delete it, and if preceded by gu (or ci) with the u (or i) in RV,
delete the u (or i).
Or if the word ends ç remove the cedilla
And finally:
-
Turn a~, o~ back into ã, õ
|