Italian stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample Italian vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

Italian stop word list
The stemmer in Snowball — MS DOS Latin I encodings
Romance language stemmers


Here is a sample of Italian vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
abbandonata
abbandonate
abbandonati
abbandonato
abbandonava
abbandonerà
abbandoneranno
abbandonerò
abbandono
abbandonò
abbaruffato
abbassamento
abbassando
abbassandola
abbassandole
abbassar
abbassare
abbassarono
abbassarsi
abbassassero
abbassato
abbassava
abbassi
abbassò
abbastanza
abbatté
abbattendo
abbattere
abbattersi
abbattesse
abbatteva
abbattevamo
abbattevano
abbattimento
abbattuta
abbattuti
abbattuto
abbellita
abbenché
abbi
  =>   abbandon
abbandon
abbandon
abbandon
abbandon
abbandon
abbandon
abbandon
abband
abbandon
abbaruff
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbass
abbast
abbatt
abbatt
abbatt
abbatt
abbattess
abbatt
abbatt
abbatt
abbatt
abbatt
abbatt
abbatt
abbell
abbenc
abbi
pronto
pronuncerà
pronuncia
pronunciamento
pronunciare
pronunciarsi
pronunciata
pronunciate
pronunciato
pronunzia
pronunziano
pronunziare
pronunziarle
pronunziato
pronunzio
pronunziò
propaga
propagamento
propaganda
propagare
propagarla
propagarsi
propagasse
propagata
propagazione
propaghino
propalate
propende
propensi
propensione
propini
propio
propizio
propone
proponendo
proponendosi
proponenti
proponeva
proponevano
proponga
  =>   pront
pronunc
pronunc
pronunc
pronunc
pronunc
pronunc
pronunc
pronunc
pronunz
pronunz
pronunz
pronunz
pronunz
pronunz
pronunz
propag
propag
propagand
propag
propag
propag
propag
propag
propag
propaghin
propal
prop
propens
propension
propin
prop
propiz
propon
propon
propon
proponent
propon
propon
propong



 

The stemming algorithm

Italian can include the following accented forms:
á   é   í   ó   ú   à   è   ì   ò   ù
First, replace all acute accents by grave accents. And, as in French, put u after q, and u, i between vowels into upper case. (See note on vowel marking.) The vowels are then
a   e   i   o   u   à   è   ì   ò   ù
R2 (see the note on R1 and R2) and RV have the same definition as in the Spanish stemmer.

Always do steps 0 and 1.

Step 0: Attached pronoun
Search for the longest among the following suffixes

ci   gli   la   le   li   lo   mi   ne   si   ti   vi   sene   gliela   gliele   glieli   glielo   gliene   mela   mele   meli   melo   mene   tela   tele   teli   telo   tene   cela   cele   celi   celo   cene   vela   vele   veli   velo   vene

following one of

(a) ando   endo
(b) ar   er   ir

in RV. In case of (a) the suffix is deleted, in case (b) it is replace by e (guardandogli -> guardando, accomodarci -> accomodare)
Step 1: Standard suffix removal
Search for the longest among the following suffixes, and perform the action indicated.

anza   anze   ico   ici   ica   ice   iche   ichi   ismo   ismi   abile   abili   ibile   ibili   ista   iste   isti   istà   istè   istì   oso   osi   osa   ose   mente   atrice   atrici   ante   anti
delete if in R2

azione   azioni   atore   atori delete if in R2
if preceded by ic, delete if in R2

logia   logie
replace with log if in R2

uzione   uzioni   usione   usioni
replace with u if in R2

enza   enze
replace with ente if in R2

amento   amenti   imento   imenti
delete if in RV

amente
delete if in R1
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by os, ic or abil, delete if in R2

ità
delete if in R2
if preceded by abil, ic or iv, delete if in R2

ivo   ivi   iva   ive
delete if in R2
if preceded by at, delete if in R2 (and if further preceded by ic, delete if in R2)
Do step 2 if no ending was removed by step 1.

Step 2: Verb suffixes
Search for the longest among the following suffixes in RV, and if found, delete.

ammo   ando   ano   are   arono   asse   assero   assi   assimo   ata   ate   ati   ato   ava   avamo   avano   avate   avi   avo   emmo   enda   ende   endi   endo   erà   erai   eranno   ere   erebbe   erebbero   erei   eremmo   eremo   ereste   eresti   erete   erò   erono   essero   ete   eva   evamo   evano   evate   evi   evo   Yamo   iamo   immo   irà   irai   iranno   ire   irebbe   irebbero   irei   iremmo   iremo   ireste   iresti   irete   irò   irono   isca   iscano   isce   isci   isco   iscono   issero   ita   ite   iti   ito   iva   ivamo   ivano   ivate   ivi   ivo   ono   uta   ute   uti   uto   ar   ir

Always do steps 3a and 3b.
Step 3a
Delete a final a, e, i, o, à, è, ì or ò if it is in RV, and a preceding i if it is in RV (crocchi -> crocch, crocchio -> crocch)
Step 3b
Replace final ch (or gh) with c (or g) if in RV (crocch -> crocc)
Finally,
turn I and U back into lower case

 

The same algorithm in Snowball


routines ( prelude postlude mark_regions RV R1 R2 attached_pronoun standard_suffix verb_suffix vowel_suffix ) externals ( stem ) integers ( pV p1 p2 ) groupings ( v AEIO CG ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a' hex 'E1' stringdef a` hex 'E0' stringdef e' hex 'E9' stringdef e` hex 'E8' stringdef i' hex 'ED' stringdef i` hex 'EC' stringdef o' hex 'F3' stringdef o` hex 'F2' stringdef u' hex 'FA' stringdef u` hex 'F9' define v 'aeiou{a`}{e`}{i`}{o`}{u`}' define prelude as ( test repeat ( [substring] among( '{a'}' (<- '{a`}') '{e'}' (<- '{e`}') '{i'}' (<- '{i`}') '{o'}' (<- '{o`}') '{u'}' (<- '{u`}') 'qu' (<- 'qU') '' (next) ) ) repeat goto ( v [ ('u' ] v <- 'U') or ('i' ] v <- 'I') ) ) define mark_regions as ( $pV = limit $p1 = limit $p2 = limit // defaults do ( ( v (non-v gopast v) or (v gopast non-v) ) or ( non-v (non-v gopast v) or (v next) ) setmark pV ) do ( gopast v gopast non-v setmark p1 gopast v gopast non-v setmark p2 ) ) define postlude as repeat ( [substring] among( 'I' (<- 'i') 'U' (<- 'u') '' (next) ) ) backwardmode ( define RV as $pV <= cursor define R1 as $p1 <= cursor define R2 as $p2 <= cursor define attached_pronoun as ( [substring] among( 'ci' 'gli' 'la' 'le' 'li' 'lo' 'mi' 'ne' 'si' 'ti' 'vi' // the compound forms are: 'sene' 'gliela' 'gliele' 'glieli' 'glielo' 'gliene' 'mela' 'mele' 'meli' 'melo' 'mene' 'tela' 'tele' 'teli' 'telo' 'tene' 'cela' 'cele' 'celi' 'celo' 'cene' 'vela' 'vele' 'veli' 'velo' 'vene' ) among( (RV) 'ando' 'endo' (delete) 'ar' 'er' 'ir' (<- 'e') ) ) define standard_suffix as ( [substring] among( 'anza' 'anze' 'ico' 'ici' 'ica' 'ice' 'iche' 'ichi' 'ismo' 'ismi' 'abile' 'abili' 'ibile' 'ibili' 'ista' 'iste' 'isti' 'ist{a`}' 'ist{e`}' 'ist{i`}' 'oso' 'osi' 'osa' 'ose' 'mente' 'atrice' 'atrici' 'ante' 'anti' // Note 1 ( R2 delete ) 'azione' 'azioni' 'atore' 'atori' ( R2 delete try ( ['ic'] R2 delete ) ) 'logia' 'logie' ( R2 <- 'log' ) 'uzione' 'uzioni' 'usione' 'usioni' ( R2 <- 'u' ) 'enza' 'enze' ( R2 <- 'ente' ) 'amento' 'amenti' 'imento' 'imenti' ( RV delete ) 'amente' ( R1 delete try ( [substring] R2 delete among( 'iv' ( ['at'] R2 delete ) 'os' 'ic' 'abil' ) ) ) 'it{a`}' ( R2 delete try ( [substring] among( 'abil' 'ic' 'iv' (R2 delete) ) ) ) 'ivo' 'ivi' 'iva' 'ive' ( R2 delete try ( ['at'] R2 delete ['ic'] R2 delete ) ) ) ) define verb_suffix as setlimit tomark pV for ( [substring] among( 'ammo' 'ando' 'ano' 'are' 'arono' 'asse' 'assero' 'assi' 'assimo' 'ata' 'ate' 'ati' 'ato' 'ava' 'avamo' 'avano' 'avate' 'avi' 'avo' 'emmo' 'enda' 'ende' 'endi' 'endo' 'er{a`}' 'erai' 'eranno' 'ere' 'erebbe' 'erebbero' 'erei' 'eremmo' 'eremo' 'ereste' 'eresti' 'erete' 'er{o`}' 'erono' 'essero' 'ete' 'eva' 'evamo' 'evano' 'evate' 'evi' 'evo' 'Yamo' 'iamo' 'immo' 'ir{a`}' 'irai' 'iranno' 'ire' 'irebbe' 'irebbero' 'irei' 'iremmo' 'iremo' 'ireste' 'iresti' 'irete' 'ir{o`}' 'irono' 'isca' 'iscano' 'isce' 'isci' 'isco' 'iscono' 'issero' 'ita' 'ite' 'iti' 'ito' 'iva' 'ivamo' 'ivano' 'ivate' 'ivi' 'ivo' 'ono' 'uta' 'ute' 'uti' 'uto' 'ar' 'ir' // but 'er' is problematical (delete) ) ) define AEIO 'aeio{a`}{e`}{i`}{o`}' define CG 'cg' define vowel_suffix as ( try ( [AEIO] RV delete ['i'] RV delete ) try ( ['h'] CG RV delete ) ) ) define stem as ( do prelude do mark_regions backwards ( do attached_pronoun do (standard_suffix or verb_suffix) do vowel_suffix ) do postlude ) /* Note 1: additions of 15 Jun 2005 */