[Snowball-discuss] Re: R: italian stemming

From: Martin Porter (martin_porter@softhome.net)
Date: Thu Sep 05 2002 - 09:04:01 BST


At 01:51 PM 9/3/02 +0200, enea wrote:
>I encountered another problem with your stemmer.
>israele is stemmed to israel
>israeliano is stemmed to israel (ano is verb suffix)
>israeliani, israeliane and israeliana is stemmed to israelian
>I wonder if I should remove ano from verb suffix and add ano, ani, ana
>and ane as standard suffixes since this problem is frequent (eg.
>italiano, italiani; partigiano, partigiani; gabbiano, gabbiani; indiano,
>indiani; isolano, isolani; romano, romani...). What do you think?
>Regards,
>
>Enea
>

Enea,

Yes -ano is problematical. It is interesting that the corresponding French
ending -ent (3rd person plural present indicative) is also problematical,
and is in fact not removed in the French stemmer. Of course -ent is slightly
broader than -ano, since it is the ending for all three classes of verb
conjugation.

One possibility is not to remove -ano at all. You might look at that.

Your idea of removing -ana, -ani, -ane often crops up in stemmer design, and
is quite sensible. As adjectival endings, you can think of them as

    -ano + -a
    -ano + -i
    -ano + -e

Finding -a, -i, -e here implies noun or adjective forms. Then that knowledge
is discarded, and -ano is removed as a verb ending, so that a match will
take place with -ano endings which are removed as verb endings when in fact
they are part of the stem. More generally, in an ending -A + -B, B may tell
us that A is not a true ending, but we choose to discard that information.
In fact the Porter stemmer is a bit like that - no state information is
preserved following ending removal in the different steps.

Whether -ana, -ani, -ane might be usefully added to the stemmer as endings I
cannot say and you will need to experiment (create a file like
italian/diffs.txt with the consequences of the stemmer change visible in a
third column, and inspect.) I may have tried this myself at one time, but
cannot remember. The danger is of course overstemming. -ana/-ane endings
will be removed from feminine nouns that have no corresponding -ano form.

My preferred advice however is to put the stemmer into service in its
present form and see what reactions you get.

Martin

 



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST