RE: [Snowball-discuss] 99% of English words ending -sis and -xis are not plurals

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Thu Dec 06 2001 - 14:05:58 GMT


Summary:
I think this new rule would easily justify itself even if the
only words affected were: thesis osmosis analysis emphasis
diagnosis prognosis synthesis and hypothesis.

Details:
I completely agree with your general line of argument, that the
benefits of any stemming rule must clearly exceed its "costs",
and that this principle applies even more stringly when the rule
is a change to an existing rule.

I have been thinking about how to formalize this notion of the
effectiveness of a stemming rule, but have not worked it out yet.
The basic idea is to count the number of true positives,
false positives, and false negatives, assign each category
a score, e.g. +1 for true positive, -2 for false positive, etc.
Then further assign each word (and/or stem) a weight based on
its expected frequency in the corpus of documents (and/or queries).
Define effectiveness as some forumla that combine the scores and
weights. Then two rules can be compared.

In the absence of this formula we can still see the benefit
of my proposed rule by inspection.

The existing -sis -> -s rule fails on almost all of the -sis words.
My propsed new rule succeeds on almost all of the -sis words.
Similarly for the propsed -xis to -xes rule.

Here is an example of the benefit of the new rule.
The following is from the current Porter2:
analyse -> analys
analysed -> analys
analyser -> analys
analysers -> analys
analyses -> analys
analysing -> analys
analysis -> analysi

All these words conflate -- except for analysis itself!
That is quite bad.
A similar pattern is seen for the other -sis words.

I think this new rule would easily justify itself even if the
only words affected were: thesis osmosis analysis emphasis
diagnosis prognosis synthesis hypothesis .

Of the 810 -sis words I think the following are reasonably common.
(Ordered by increasing length and then alphabetically.)

basis oasis crisis miosis stasis thesis chassis genesis kinesis
meiosis mitosis nemesis osmosis analysis dialysis dieresis ellipsis
emphasis hypnosis narcosis necrosis neurosis synopsis catalysis
catharsis cirrhosis coreopsis diaeresis diagnosis halitosis oogenesis
pertussis prognosis psoriasis psychosis sclerosis scoliosis silicosis
symbiosis synthesis hypothesis hysteresis thrombosis

There are few common -xis words that are singular: perhaps
axis, praxis, cathexis, prophyllaxis are the most common.
But there are two other reasons
to make that change: of these 58 only xis, maxis, and taxis are
plurals -- all the rest are singular words whose plural is -xes.
Thus we greatly increase accuracy.
In addition almost all of these are technical terms. While
I personally have not searched on e.g. phototaxis or
tropotaxis I expect that scientists do want to search
on one of the 30+ words that end with -taxis.

 
Hopefully helpfully yours,
Steve

-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

> -----Original Message----- > From: martin_porter@softhome.net [mailto:martin_porter@softhome.net] > Sent: Thursday, December 06, 2001 7:07 AM > To: the Tolkin family; snowball-discuss@lists.sourceforge.net > Cc: Tolkin, Steve > Subject: Re: [Snowball-discuss] 99% of English words ending -sis and > -xis are not plurals > > > > Steve, > > Well, I am still reeling after the suggestion that -ive should not be > removed, :-) , which indeed is not a bad suggestion. > > I'm less sure about the -sis -xis idea, although it is very > interesting. It > is really a rule for handling Greek plurals, and English is unusual in > accepting many plural forms from other languages: beaux (French), > cognoscenti (Italian), cacti (Latin), hypotheses (Greek), > seraphim (Hebrew). > (It is a phenomenon quite easy to explain historically > however.) It has > occurred to me that one might be able to work out the > language of a word by > digram analysis - or something similar - and stem > accordingly. So hypnotic > is "obviously" Greek, and stems to hypnos, chateaux is > "obviously" French > and stems to chateau. Greek -sis endings are therefore part > of a general > problem. > > Remember in any case that an English stemmer is going to > regard -ses endings > as normal plural forms, and remove -s, abuses, bookcases, > houses etc. The > Porter stemmer removes -es from longer words, so the problem > reduces to > removing -is from analysis etc. The Lovins stemmer (which is > more concerned > with "scientific" vocabulary) does that, but also respells > final -yt as -ys > so that analysis, analyses, analytic conflate. > > The question is, how important is it in practice. One should > not be too > influenced by something like YAWL, which is more an aid to > scrabble players > than practical list of words for contemporary English. > (although if someone > put down "chaprassis" and claimed a triple word score I'd be > most upset!) My > sample vocabularies only instance two successful conflations > with a rule > like this: hypothesis/hypotheses and parenthesis/parentheses. > I realise > there are more in the language as a whole (oasis/oases for > example), but the > point is that a rule is hardly worth adding if it only > affects one word in > 20,000. > > The truth is words like bases (as a plural of basis), > hypnoses, ellipses are > not used very much. We tend to avoid forming plurals when the > plural is > dubious, and use a different contruction. Everyone says "CVs" > because they > don't know the plural of "curriculum vitae", and avoid trying > to pluralise > words like chassis, chablis, cyclops, Mrs ... > > (As a general feature of English, exotic plurals are > declining. Hippos for > hippopotami, cactuses for cacti, eskimos for esquimaux etc. > Americans say > syllabi, but that sounds strange in England. Dice has become > the singular > form of what was once die. Perhaps one day the plural will be dices. > Ignorance of foreign languages must help here. News broadcasters use > papperazzi in the singular without thinking it strange.) > > Martin > >

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________ VirusChecked by the Incepta Group plc _____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST