[Snowball-discuss] RE: Snowball-discuss digest, Vol 1 #25 - 1 msg

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Mon Dec 24 2001 - 14:39:42 GMT


Dear Martin,
        Thanks for this information. I have a few comments.

1. The endings -er (and -ier) and -est (and -iest) for
comparative and superlative forms of adjectives and adverbs
seem to be missing. I suggest
they belong in the list of Inflexional endings.

2. You said:
> This is a minimum
> list: you can argue for other forms (ableness for example).
So I presume this list was created in a somewhat manual way.
Another advantage of having a generate mode for Snowball:
Any change to the Snowball code and/or rules for a language
could be automatically tested by comparing the new list of
endings with the list before the change.
This would be very useful for QA (quality assurance).
As you point out, if a generate mode is added then there also
needs to be a way to set the maximum number.

3. You said:
> I think ending generation helps understand stemmers, but I'm
> not sure that
> classes of endings are utilizable by IR systems, if only
> because there are
> so many of them.

But modern computers are really fast and have large main memories
compared to years ago. I think a system could generate all these,
and look them up even in a very large wordlist in < 0.01 second.

However I agree that there are so many that it might be worthwhile
to try to reverse strategy, i.e start from the dictionary and
test all the words that share the same first several letters with
the given word. So my next question is to find a formula
for the maximum length common prefix. Given a word, w,
we can find its stem, s, quickly. Suppose the stem is of length
n. Is there a formula, e.g. n-2, that ensures that all words
having the same stem as w will begin with the first n-2 characters
as s. I suspect so. Further I suspect that this formula
may be made more efficient by a few extra tests, e.g. if the
stem ends with "i" use n-2 otherwise n-1.
(That's an example -- the real rules are probably somewhat more
complex.) Given these rules it might be faster to scan the
dictionary, and then generating and testing stems.

P.S. Yes, keeping the name Snowball is fine.
I sent the earlier email so we would know about that
other project.
 
Hopefully helpfully yours,
Steve

-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

> -----Original Message----- > From: snowball-discuss-request@lists.sourceforge.net > [mailto:snowball-discuss-request@lists.sourceforge.net] > Sent: Sunday, December 23, 2001 3:28 PM > To: snowball-discuss@lists.sourceforge.net > Subject: Snowball-discuss digest, Vol 1 #25 - 1 msg ... > > Message: 1 > To: "Tolkin, Steve" <Steve.Tolkin@FMR.COM> > From: martin_porter@softhome.net (Martin Porter) > Subject: Re: [Snowball-discuss] Can snowball be run backwards > to generate words? > Cc: snowball-discuss@lists.sourceforge.net > Date: Sat, 22 Dec 2001 14:56:28 -0700 > > > You can turn the Porter stemmer inside out, and generate all > endings that > the stemmer will recognise, but there are several problems. > One is that the > endings go in a circles, e.g. > > ize + ation as in realization > ation + al as in operational > al + ize as in normalize > > - suggesting infinite endings izationalizational... You can > break the loop > by noting that four is the upper limit on the number of derivational > suffixes that can be attached to a word in English. > > If you do this, you end up with really quite a lot of > endings. Here is a > list I put together recently, > > Inflexional: ed ing ings s > > Derivational: > ic ioned *ationed *icationed > *izationed *alizationed ered *izered > *alizered *icalizered *ionalizered ated > icated ized alized *icalized > *ionalized *ationalized ance ence > able ible ate icate > ive ative icative ize > alize *icalize *ionalize *ationalize > ioning *ationing *icationing *izationing > *alizationing ering *izering *alizering > *icalizering *ionalizering ating icating > izing *alizing *icalizing *ionalizing > *ationalizing al ical ional > ational *icational *izational ful > ism alism *icalism *ionalism > *ationalism ion ation ication > ization alization er izer > *alizer *icalizer *ionalizer ator > ics ances ences ancies > encies ities icities alities > *icalities ionalities *ationalities abilities > ibilities *ivities *ativities *icativities > ables ibles nesses *ivenesses > *ativenesses *icativenesses *alnesses *icalnesses > *ionalnesses *ationalnesses *fulnesses *ousnesses > ates icates ives atives > *icatives izes *alizes *icalizes > *ionalizes *ationalizes als icals > ionals *ationals *icationals *izationals > isms *alisms *icalisms *ionalisms > *ationalisms ions ations ications > izations *alizations ers izers > *alizers *icalizers *ionalizers ators > ness iveness *ativeness *icativeness > alness *icalness ionalness *ationalness > fulness ousness ants ents > ments ements ous ant > ent ment ement ancy > ency ly ably ibly > ately *icately ively atively > *icatively ally ically ionally > ationally ously ently *mently > *emently ity icity ality > icality ionality *ationality ability > ibility ivity *ativity *icativity > > - sorted by ending and arranged in 4 columns. The endings > marked * are very > rare or non-existent and could be ignored. There are some extra rules: > endings beginning ion should follow s or t in the stem. This > is a minimum > list: you can argue for other forms (ableness for example). > > If a word is se, where s is the stem and e the ending, > looking up all the s* > where * is any of these endings could be quite expensive therefore. > > Sometimes classes of endings can be eliminated on grammatical > grounds. For > example, ness forms nouns from adjectives, and able forms > adjectives from > nouns, so you would not expect them to attach to the same > word. But there > are many exceptions to rules like this. > > I think ending generation helps understand stemmers, but I'm > not sure that > classes of endings are utilizable by IR systems, if only > because there are > so many of them. > > Martin > > > > > --__--__-- > > _______________________________________________ > Snowball-discuss mailing list > Snowball-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/snowball-discuss > > > End of Snowball-discuss Digest >

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST