[Snowball-discuss] 99% of English words ending -sis and -xis are not plurals

From: the Tolkin family (tolkin@mediaone.net)
Date: Wed Dec 05 2001 - 03:53:19 GMT


A while ago I said I might suggest more fundamental changes to
the approach used in the Porter2 stemmer.
Here is another one, probably my last.
(You probably can also improve handling of f -> v e.g.life, self, etc.)

There are over 800 words that end with -sis and of these only 11,
about 1%, are plurals.
Almost all the rest are singular words, whose plural ends with -ses.
The words that are plurals are generally quite uncommon. Here they are:
brindisis chaprassis dalasis kolbasis kolbassis lassis
pachisis parchesis parchisis reversis sannyasis tsotsis

So instead of the current rule, which simply removes the final -s, I propose
the following rule, which changes -sis to -ses, with a few exceptions.
(We generally want to conflate singular and plural. But there are too
many -ses words to go in the usual direction from plural to singular.
So this rules goes in the other direction.)
This must be run before the current rule 1, so I'll call it rule 0.5a.
I express this in pseudocode.

if word ends with sis {
  if word is sis then stem is sis && stop
  if word is psis then stem is psi && stop
  if word is thesis then stem is thesis && stop
  if word is theses then stem is thesis && stop
  change final sis to ses
}

I put special handling for thesis and theses because otherwise these
would become "these". Certainly thesis is a likely search term.
(Another possible stem for thesis and theses might be "thes".)

(The rule above could be written so that -sis must occur in the R1
or R2 region. That would remove the special cases for sis and psis,
but would cause the need to add several others.)

The 11 true plurals above are not longer handled correctly, but those words
are rare and many other plurals are not handled correctly today, so I do not bother
to fix them Perhaps could special case lassis -> lassi to avoid clash with lass.

Another possible special case is "basis". The rule above conflates it with bases,
which is its plural, but that causes it to also conflate with base. One might want
to add another special case: if word is basis then stem is basis && stop
This rules causes a few conflations that might not be as desirable as possible,
e.g. ellipsis and ellipses, synapsis and synapses, phasis and phases,
and whosis and whose.
These could also be worth adding to the list of special cases.
But I have tried to have as few as possible.

An analogous rule applies to -xis. Again, almost all of the about 60 words
ending with -xis are not plural.
The rule 0.5b below captures this, and the few exceptions.

if word ends with xis {
  if word is xis then stem is xi && stop
  if word is maxis then stem is maxi && stop
  if word is taxis then stem is taxi && stop
  change final xis to xes
}

Here axis gets conflated with axes (its plural) but also with axe. That seems
acceptable. (There is a singular word taxis, with plural taxes, but both those
strings are far more common in their usual meaning. We do not want to
conflate taxis with tax.)

Misc.
I have written these as 2 separate rules but a performance tweak might test if
the word ends with -is first.

On a completely separate topic, the words "lens" is another word
that should be special cased to return "lens" as its stem , so that
it conflates with lenses (and so it does not conflate with the
common computer science abbreviation for length.)

References:
This analysis is based on the very large list of words known as YAWL (Yet Another
Word List) available from e.g. http://personal.riverusers.com/~thegrendel/software.html
and elsewhere.

Hopefully helpfully yours,
Steve

-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.

_____________________________________________________________________ VirusChecked by the Incepta Group plc _____________________________________________________________________

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST