Re: [Snowball-discuss] Comon strings ending in s that may not be ordinary words

From: Martin Porter (martin_porter@softhome.net)
Date: Mon Nov 26 2001 - 15:46:51 GMT


Steve,

Well of course I've heard of your IRS. You get similar acronyms in English
too, Ipswich Cooperative Society, University Superannuation Scheme, ...

My own feeling is that this is all part of the question "what is a word?"
and should be kept separate from stemming. You have to assume a process that
can separate real words from character sequences like US (United States),
C++, BM25, IBM, H2SO4 etc. In modern Anglo-American English it is reasonable
to assume that short capitalised words are of this special type, so long as
they occur among lower case forms. If whole sentences are capitalised it's a
different story of course.

I think it's still reasonable say -s should not be removed when there are no
vowels - it is like saying that endings should not be removed when the
syllable count is zero.

Similar remarks can be made for stopwords. One ought to have an indexer that
can detect from the punctuation that 'either' etc are not being used as a
stopword in these sentences:

  In Elizabethan times 'either' was often used to mean 'both'
  <I>and</and> can be used to connect almost any type of grammatical construct

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST