RE: [Snowball-discuss] Some possible improvements in English

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Wed Nov 21 2001 - 16:58:52 GMT

Dear Martin,
        My comments interspersed below.

> -----Original Message-----
> From: []
> Sent: Wednesday, November 21, 2001 9:28 AM
> To:
> Cc:
> Subject: Re: [Snowball-discuss] Some possible improvements in English
> Steve,
> I've added 'cosmos', 'atlas' to porter2 as exceptions, just
> to show how
> amenable I am.

Thanks. I thought of one more: 'bias' should stay 'bias'.
Currently Porter2 produces:
bias -> bia
biases -> bias
biased -> bias

The word 'gas' has a related problem that cannot be solved
so easily:
gas -> ga
gases -> gase
gassed -> gass

There are probably a bunch of others in the singular
but ends with "s" category.

> The -ive endings are a different matter: the
> word pairs share
> polysemic forms, the question being to what extent the
> sharing is small
> enough to warrant separating the words (see the introductory
> paper). To
> solve this problem one should be using the stemmer in
> conjunction with a
> dictionary. Anyway there are many worse examples: I discover that
> 'combative' stems to 'comb'!

I am not sure that combative -> comb is worse because both of these
are less common than e.g. productive -> product etc.
But I think the problem with *-ive is serious. Besides the four
I gave originally, and your combative, I think there are lots
more: progressive, elective, recessive, etc.
The fundamental problem seems to be that the *-ive form
has acquired a specialized meaning quite different
from that of the stem word.

Similarly this conflation:
derivative -> deriv
derive -> deriv
(where the stem is not a word) seems bad,
because derivative is so often used in a technical sense,
e.g. in mathematics or finance.

I am leaning to the conclusion that it is better to
reduce *-ive and *-ivity etc. to just *-iv rather than further.
(Aside: it seem that these are all *-tive and *-sive.
Are their other letters that preceed 'ive'? Does this
observation help?)

> As I said you can always add pet exceptions to a private version.

Yes, but I'd prefer not to have a private version. You also said
"Researchers frequently pick up faulty versions
of the stemmer and report that they have
applied 'Porter stemming', with the result that
their experiments are not quite repeatable."

I believe it would be best to have a common stemmer,
i.e. yours. Even if it evolved over time people could
state that they used Porter-2001-11 for example.

What you might want to do is apply the slogan
"Given enough eyes all bugs are shallow"
but apply it to data rather than code.
If there were some way to get lots of people to critique
the results it might be possible to produce a consensus list.

> What was the context in which you noticed these weaknesses of
> the stemmer
> incidentally? I'd be interested to know.

I just noticed them at some point and wrote them down.
(I was working on a text summarization project.
I also noticed many problems with /usr/dict/words
e.g. it is missing "instrument", and many other words,
and most plurals, etc. Is there a good dictionary somewhere,
not so big as yawl or cracklib, e.g. a dictionary of the
60,000 most common ordinary words.)

Also, it would be useful to have a way to conflate
British and American spelling for certain common words:
especially s <-> z and ou <-> o, e.g.
analyze, color, favorite, etc.
I am not looking for cookie -> bisquit :)

> Martin
Hopefully helpfully yours,

Steven Tolkin      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

_______________________________________________ Snowball-discuss mailing list

_____________________________________________________________________ VirusChecked by the Incepta Group plc _____________________________________________________________________

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST