[Snowball-discuss] Re: Spanish word stemmer

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Tue Jun 14 2005 - 17:07:41 BST


Felipe,

Thank you for the suffix list. It has proved interesting to work through it. The
general answer to your question is that the non-inclusion of these suffixes
(apart from ante/antes -- see below) is intentional, and also is best for the
algorithm.

You must remember that if a word end with X, and X is a suffix in the language,

a) X may significantly alter the meaning of a word. In this case it should not,
in an IR context, be removed.

b) X may not be a true suffix, but merely form the end of the stem.

c) X may be rare in the language, and hardly therefore wirth removing.

d) X may be removable, but not worth removing because it leads to no further
conflations.

Most of the suffixes you instance exhibit one or more of these features. Here is
your list:

SUFFIX EXAMPLES STEMMER RIGHT

orio/a/os/as <--- changes meaning too much
 absolutorio absolutori absolut
 accesorio accesori acces
 consultorio consultori consult

atorio/ia/ios/ias <--- changes meaning too much
 adoratorio adoratori ador
 eliminatorias eliminatori elimin
 acusatorias acusatori acus
 amatorias amatori amat
 aclaratorio aclaratori aclar

ante/es <================= done
 agonizante agonizant agoniz
 alarmante alarmant alarm
 abundante abundant abund
 caminante caminant camin
 emigrante emigrant emigr
 participante participant particip

io/ia/ios/ias
 agravio agravi agrav
 alergia alergi alerg
 agraria agrari agrar
 academia academi academ

or/ora/ores/oras
 agresor agresor agres

ion <--- removal rarely results in useful
conflation
 agresión agresion agres
 admisión admision admis
 adopción adopcion adopc
 afección afeccion afecc

ito/ita/itos/itas <--- diminutive; often not an ending; rare
 ahorita ahorit ahor
 abuelita abuelit abuel

esa/esas <--- feminine; often not an ending; rare
 alcaldesa alcaldes alcald

ador/edor/idor <--- alters meaning too much:
                                    abrir open; abridor can-, bottle-opener
                                    conocer know; conedor expert
 nadador nadador nad
 corredor corredor corr
 abridor abridor abr
 ganador ganador gan
 rompedor rompedor romp
 seguidor seguidor segu

ia see ia above
 alemania alemani aleman
 italia itali ital
 francia franci franc

icio <--- rare alimentar feed; alimenticio
nourishing
 alimenticio alimentici aliment

al <--- rarer than English; many exceptions
                                    cardenal: cardinal (Math), cardeno purple
 ambiental ambiental ambient
 opcional opcional opcion
 monumental monumental monument
 doctoral doctoral doctor
 arbitral arbitral arbitr
 semanal semanal seman
 accidental accidental accident

ote/ota <--- augmentative
 amigote amigot amig
 grandote grandot grand
 palabrota palabrot palabr

ete/etes <--- ?
 abogadete abogadet abog

illo/a/os/as
 abogadillo abogadill abog

ato/atos
 anonimato anonimat anonim
 asesinato asesinat asesin
 alegato alegat aleg

aje/ajes
 arbitraje arbitraj arbitr
 aterrizaje aterrizaj aterriz
 camuflaje camuflaj camufl
 doblaje doblaj dobl

edad/edades <--- stem too short for this case
 brevedad breved brev
 enfermedad enfermed enferm
 gravedad graved grav
 salvedad salved salv

ísimo/ísimos <-- like ital. issimo
 buenísimo buenisim buen
 malísimo malisim mal
 rarísimo rarisim rar

ez/eces
 estupidez estupidez estupid
 sencillez sencillez sencill
 acidez acidez acid
 robustez robustez robust

izar
 actualizar actualiz actual
 mecanizar mecaniz mecan
 colonizar coloniz colon
 agilizar agiliz agil
 civilizar civiliz civil

So -orio on the whole changes meaning too much (acceso = access, accessorio =
accessory differ as much in Spanish as English; -atorio similarly (aclarar to
rinse, clear (in a very general sense), brighten up; aclaratorio = explanatory).

Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote, -isimo
are in this category.

-al and -iz look like plausible candidates for ending removal, but, unlike their
English counterparts, removing them makes little difference or improvement.
Similarly with -ion removal after -s.

There is a difficulty with pure vowel endings, and the stemmer can't always get
this right. So in English 'academic' is stemmed to 'academ' but 'academy' does
not lose the final -y (or -i). This explains the residual vowels with -io, -ia
endings etc.

Your -edad endings are not removed when the stem is this short: the shorter the
stem the more chance there is of a suffix strongly altering word meaning (see
the original Porter stemmer discussion).

But you spotted ante/antes, which is useful and which I have added in (new
release soon). I can see historically how this came to be omitted, but I won't
bore you with the details.

In the case of attached pronouns, I only included the commoner forms. (For
example, '-noslo' appeared nowhere in our sample data.)

Your question about 'Among' I did not understand. Is this in the java generated
code?

I hope this answers your various questions.

Martin



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST