Portuguese stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample Portuguese vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

Portuguese stop word list
The stemmer in Snowball — MS DOS Latin I encodings
Romance language stemmers


Here is a sample of Portuguese vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
boa
boainain
boas
bôas
boassu
boataria
boate
boates
boatos
bob
boba
bobagem
bobagens
bobalhões
bobear
bobeira
bobinho
bobinhos
bobo
bobs
boca
bocadas
bocadinho
bocado
bocaiúva
boçal
bocarra
bocas
bode
bodoque
body
boeing
boem
boemia
boêmio
boêmios
bogotá
boi
bóia
boiando
  =>   boa
boainain
boas
bôas
boassu
boat
boat
boat
boat
bob
bob
bobag
bobagens
bobalhõ
bob
bobeir
bobinh
bobinh
bob
bobs
boc
boc
bocadinh
boc
bocaiúv
boçal
bocarr
boc
bod
bodoqu
body
boeing
boem
boem
boêmi
boêmi
bogot
boi
bói
boi
quiabo
quicaram
quickly
quieto
quietos
quilate
quilates
quilinhos
quilo
quilombo
quilométricas
quilométricos
quilômetro
quilômetros
quilos
química
químicas
químico
químicos
quimioterapia
quimioterápicos
quimono
quincas
quinhão
quinhentos
quinn
quino
quinta
quintal
quintana
quintanilha
quintão
quintessência
quintino
quinto
quintos
quintuplicou
quinze
quinzena
quiosque
  =>   quiab
quic
quickly
quiet
quiet
quilat
quilat
quilinh
quil
quilomb
quilométr
quilométr
quilômetr
quilômetr
quil
químic
químic
químic
químic
quimioterap
quimioteráp
quimon
quinc
quinhã
quinhent
quinn
quin
quint
quintal
quintan
quintanilh
quintã
quintessent
quintin
quint
quint
quintuplic
quinz
quinzen
quiosqu



 

The stemming algorithm



Letters in Portuguese include the following accented forms,
á   é   í   ó   ú   â   ê   ô   ç   ã   õ   ü   ñ
The following letters are vowels:
a   e   i   o   u   á   é   í   ó   ú   â   ê   ô
And the two nasalised vowel forms,
ã   õ
should be treated as a vowel followed by a consonant.

ã and õ are therefore replaced by a~ and o~ in the word, where ~ is a separate character to be treated as a consonant. And then —

R2 (see the note on R1 and R2) and RV have the same definition as in the Spanish stemmer.

Always do step 1.

Step 1: Standard suffix removal
Search for the longest among the following suffixes, and perform the action indicated.

eza   ezas   ico   ica   icos   icas   ismo   ismos   ável   ível   ista   istas   oso   osa   osos   osas   amento   amentos   imento   imentos   adora   ador   aça~o   adoras   adores   aço~es   ante   antes   ância
delete if in R2

logía   logías
replace with log if in R2

ución   uciones
replace with u if in R2

ência   ências
replace with ente if in R2

amente
delete if in R1
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by os, ic or ad, delete if in R2

mente
delete if in R2
if preceded by ante, avel or ível, delete if in R2

idade   idades
delete if in R2
if preceded by abil, ic or iv, delete if in R2

iva   ivo   ivas   ivos
delete if in R2
if preceded by at, delete if in R2

ira   iras
replace with ir if in RV and preceded by e
Do step 2 if no ending was removed by step 1.

Step 2: Verb suffixes
Search for the longest among the following suffixes in RV, and if found, delete.

ada   ida   ia   aria   eria   iria   ará   ara   erá   era   irá   ava   asse   esse   isse   aste   este   iste   ei   arei   erei   irei   am   iam   ariam   eriam   iriam   aram   eram   iram   avam   em   arem   erem   irem   assem   essem   issem   ado   ido   ando   endo   indo   ara~o   era~o   ira~o   ar   er   ir   as   adas   idas   ias   arias   erias   irias   arás   aras   erás   eras   irás   avas   es   ardes   erdes   irdes   ares   eres   ires   asses   esses   isses   astes   estes   istes   is   ais   eis   íeis   aríeis   eríeis   iríeis   áreis   areis   éreis   ereis   íreis   ireis   ásseis   ésseis   ísseis   áveis   ados   idos   ámos   amos   íamos   aríamos   eríamos   iríamos   áramos   éramos   íramos   ávamos   emos   aremos   eremos   iremos   ássemos   êssemos   íssemos   imos   armos   ermos   irmos   eu   iu   ou   ira   iras

If the last step to be obeyed — either step 1 or 2 — altered the word, do step 3
Step 3
Delete suffix i if in RV and preceded by c
Alternatively, if neither steps 1 nor 2 altered the word, do step 4

Step 4: Residual suffix
If the word ends with one of the suffixes

os   a   i   o   á   í   ó

in RV, delete it
Always do step 5

Step 5:
If the word ends with one of

e   é   ê

in RV, delete it, and if preceded by gu (or ci) with the u (or i) in RV, delete the u (or i).

Or if the word ends ç remove the cedilla
And finally:
Turn a~, o~ back into ã, õ

 

The same algorithm in Snowball


routines ( prelude postlude mark_regions RV R1 R2 standard_suffix verb_suffix residual_suffix residual_form ) externals ( stem ) integers ( pV p1 p2 ) groupings ( v ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a' hex 'E1' // a-acute stringdef a^ hex 'E2' // a-circumflex e.g. 'bota^nico stringdef e' hex 'E9' // e-acute stringdef e^ hex 'EA' // e-circumflex stringdef i' hex 'ED' // i-acute stringdef o^ hex 'F4' // o-circumflex stringdef o' hex 'F3' // o-acute stringdef u' hex 'FA' // u-acute stringdef c, hex 'E7' // c-cedilla stringdef a~ hex 'E3' // a-tilde stringdef o~ hex 'F5' // o-tilde define v 'aeiou{a'}{e'}{i'}{o'}{u'}{a^}{e^}{o^}' define prelude as repeat ( [substring] among( '{a~}' (<- 'a~') '{o~}' (<- 'o~') '' (next) ) //or next ) define mark_regions as ( $pV = limit $p1 = limit $p2 = limit // defaults do ( ( v (non-v gopast v) or (v gopast non-v) ) or ( non-v (non-v gopast v) or (v next) ) setmark pV ) do ( gopast v gopast non-v setmark p1 gopast v gopast non-v setmark p2 ) ) define postlude as repeat ( [substring] among( 'a~' (<- '{a~}') 'o~' (<- '{o~}') '' (next) ) //or next ) backwardmode ( define RV as $pV <= cursor define R1 as $p1 <= cursor define R2 as $p2 <= cursor define standard_suffix as ( [substring] among( 'eza' 'ezas' 'ico' 'ica' 'icos' 'icas' 'ismo' 'ismos' '{a'}vel' '{i'}vel' 'ista' 'istas' 'oso' 'osa' 'osos' 'osas' 'amento' 'amentos' 'imento' 'imentos' 'adora' 'ador' 'a{c,}a~o' 'adoras' 'adores' 'a{c,}o~es' // no -ic test 'ante' 'antes' '{a^}ncia' // Note 1 ( R2 delete ) 'log{i'}a' 'log{i'}as' ( R2 <- 'log' ) 'uci{o'}n' 'uciones' ( R2 <- 'u' ) '{e^}ncia' '{e^}ncias' ( R2 <- 'ente' ) 'amente' ( R1 delete try ( [substring] R2 delete among( 'iv' (['at'] R2 delete) 'os' 'ic' 'ad' ) ) ) 'mente' ( R2 delete try ( [substring] among( 'ante' // Note 1 'avel' '{i'}vel' (R2 delete) ) ) ) 'idade' 'idades' ( R2 delete try ( [substring] among( 'abil' 'ic' 'iv' (R2 delete) ) ) ) 'iva' 'ivo' 'ivas' 'ivos' ( R2 delete try ( ['at'] R2 delete // but not a further ['ic'] R2 delete ) ) 'ira' 'iras' ( RV 'e' // -eira -eiras usually non-verbal <- 'ir' ) ) ) define verb_suffix as setlimit tomark pV for ( [substring] among( 'ada' 'ida' 'ia' 'aria' 'eria' 'iria' 'ar{a'}' 'ara' 'er{a'}' 'era' 'ir{a'}' 'ava' 'asse' 'esse' 'isse' 'aste' 'este' 'iste' 'ei' 'arei' 'erei' 'irei' 'am' 'iam' 'ariam' 'eriam' 'iriam' 'aram' 'eram' 'iram' 'avam' 'em' 'arem' 'erem' 'irem' 'assem' 'essem' 'issem' 'ado' 'ido' 'ando' 'endo' 'indo' 'ara~o' 'era~o' 'ira~o' 'ar' 'er' 'ir' 'as' 'adas' 'idas' 'ias' 'arias' 'erias' 'irias' 'ar{a'}s' 'aras' 'er{a'}s' 'eras' 'ir{a'}s' 'avas' 'es' 'ardes' 'erdes' 'irdes' 'ares' 'eres' 'ires' 'asses' 'esses' 'isses' 'astes' 'estes' 'istes' 'is' 'ais' 'eis' '{i'}eis' 'ar{i'}eis' 'er{i'}eis' 'ir{i'}eis' '{a'}reis' 'areis' '{e'}reis' 'ereis' '{i'}reis' 'ireis' '{a'}sseis' '{e'}sseis' '{i'}sseis' '{a'}veis' 'ados' 'idos' '{a'}mos' 'amos' '{i'}amos' 'ar{i'}amos' 'er{i'}amos' 'ir{i'}amos' '{a'}ramos' '{e'}ramos' '{i'}ramos' '{a'}vamos' 'emos' 'aremos' 'eremos' 'iremos' '{a'}ssemos' '{e^}ssemos' '{i'}ssemos' 'imos' 'armos' 'ermos' 'irmos' 'eu' 'iu' 'ou' 'ira' 'iras' (delete) ) ) define residual_suffix as ( [substring] among( 'os' 'a' 'i' 'o' '{a'}' '{i'}' '{o'}' ( RV delete ) ) ) define residual_form as ( [substring] among( 'e' '{e'}' '{e^}' ( RV delete [('u'] test 'g') or ('i'] test 'c') RV delete ) '{c,}' (<-'c') ) ) ) define stem as ( do prelude do mark_regions backwards ( do ( ( ( standard_suffix or verb_suffix ) and do ( ['i'] test 'c' RV delete ) ) or residual_suffix ) do residual_form ) do postlude ) /* Note 1: additions of 15 Jun 2005 */