Re: [Snowball-discuss] Can snowball be run backwards to generate words?

From: Martin Porter (martin_porter@softhome.net)
Date: Sat Dec 22 2001 - 21:56:28 GMT


You can turn the Porter stemmer inside out, and generate all endings that
the stemmer will recognise, but there are several problems. One is that the
endings go in a circles, e.g.

   ize + ation as in realization
   ation + al as in operational
   al + ize as in normalize

- suggesting infinite endings izationalizational... You can break the loop
by noting that four is the upper limit on the number of derivational
suffixes that can be attached to a word in English.

If you do this, you end up with really quite a lot of endings. Here is a
list I put together recently,

Inflexional: ed ing ings s

Derivational:
            ic ioned *ationed *icationed
*izationed *alizationed ered *izered
     *alizered *icalizered *ionalizered ated
        icated ized alized *icalized
*ionalized *ationalized ance ence
          able ible ate icate
           ive ative icative ize
         alize *icalize *ionalize *ationalize
        ioning *ationing *icationing *izationing
 *alizationing ering *izering *alizering
  *icalizering *ionalizering ating icating
         izing *alizing *icalizing *ionalizing
 *ationalizing al ical ional
       ational *icational *izational ful
           ism alism *icalism *ionalism
   *ationalism ion ation ication
       ization alization er izer
       *alizer *icalizer *ionalizer ator
           ics ances ences ancies
        encies ities icities alities
*icalities ionalities *ationalities abilities
     ibilities *ivities *ativities *icativities
         ables ibles nesses *ivenesses
  *ativenesses *icativenesses *alnesses *icalnesses
  *ionalnesses *ationalnesses *fulnesses *ousnesses
          ates icates ives atives
     *icatives izes *alizes *icalizes
*ionalizes *ationalizes als icals
        ionals *ationals *icationals *izationals
          isms *alisms *icalisms *ionalisms
  *ationalisms ions ations ications
      izations *alizations ers izers
      *alizers *icalizers *ionalizers ators
          ness iveness *ativeness *icativeness
        alness *icalness ionalness *ationalness
       fulness ousness ants ents
         ments ements ous ant
           ent ment ement ancy
          ency ly ably ibly
         ately *icately ively atively
*icatively ally ically ionally
     ationally ously ently *mently
      *emently ity icity ality
       icality ionality *ationality ability
       ibility ivity *ativity *icativity

- sorted by ending and arranged in 4 columns. The endings marked * are very
rare or non-existent and could be ignored. There are some extra rules:
endings beginning ion should follow s or t in the stem. This is a minimum
list: you can argue for other forms (ableness for example).

If a word is se, where s is the stem and e the ending, looking up all the s*
where * is any of these endings could be quite expensive therefore.

Sometimes classes of endings can be eliminated on grammatical grounds. For
example, ness forms nouns from adjectives, and able forms adjectives from
nouns, so you would not expect them to attach to the same word. But there
are many exceptions to rules like this.

I think ending generation helps understand stemmers, but I'm not sure that
classes of endings are utilizable by IR systems, if only because there are
so many of them.

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST