[Snowball-discuss] Can snowball be run backwards to generate words?

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Wed Dec 19 2001 - 21:53:47 GMT


Currently snowball runs "forwards", i.e. given a word (or any string)
it reduces it to its stem.

It would also be desirable to provide the stem as input
and generate all the possible words (really strings) that would
be reduced to that stem.

Some formal grammars permit either of these operations,
i.e. both parsing and genertating. But it seems that
the current snowball approach does not naturally lend itself
to the task of generation.

As a "use case" for this capability suppose I have a word,
and a dictionary, and I want to find all the other words
that would conflate with the input word. It would be far more
efficient to expand the stem to all its strings, and then
test these against the dictionary than to test all the
words in the dictionary.

A practical application of this would be query expansion.
Even if there is no dictionary adding all the strings to the query
might not be harmful, as the non-words are likely to be rare.

Partly the point of asking this is to refute the claim of some systems that
stemming matches all "words" that have the same stem. Of course
it will really match all "strings" that have the same stem, regardless
of whether that string is actually a word.

Just for fun here I show below some of the possible suffixes
attached to the dummy input stem "foobar".

foobars -> foobar
foobarred -> foobar
foobared -> foobar
foobarly -> foobar
foobaric -> foobar
foobarical -> foobar
foobarically -> foobar
foobaration -> foobar
foobaring -> foobar
foobarings -> foobar
foobarative -> foobar
foobaratively -> foobar
foobaral -> foobar
foobarals -> foobar
foobarous -> foobar
foobarously -> foobar
foobarment -> foobar

But there may be more -- I did this from memory of the suffixes in Porter.
(I omitted some of the plural forms.)
If we could run the stemmer in reverse we could reliably generate all
the input that reduced to foobar.

(Of course other stems would accept a different set of suffixes,
depending on their last letters.)

Hopefully helpfully yours,
Steve

-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST