[Snowball-discuss] Re: Porter stemming.

From: Martin Porter (martin_porter@softhome.net)
Date: Sun Jun 16 2002 - 15:56:54 BST


Dear Per Kristen,

The points you raise are of general interest in stemming, so I trust you
won't mind if I post the answer to the Snowball discuss list.

The ize/ise discrepancy is frequently commented on. Here for example is a
mail I sent on 22 Feb 2001 on this subject:

---------------------------

Re: Stemming American English vs. English

Dear André,

I don't think you need worry too much about English/American spelling
differences, as far as the Porter stemming algorithm is concerned. The main
difference is that -ize and -ise endings are (as you note) applied
differently in American and English usage, and the algorithm treats -ize as
an ending but not -ise.

Many people have adapted the algorithm by adding -ise to the list of
endings, but on balance I think that is a mistake. There are too many words
ending -ise where -ise should not be removed.

American spelling is much more logical than English, and -ize/-ise usage is
no exception. So in fact the Porter stemmer probably does better with
American English than with English English!

As a matter of fact -ize usage in England used to be much closer to the
American style than it now is. Here are Thackeray's -ize endings from Vanity
Fair (published 1847):

agonized
apologize apologized
authorized
capitalized
characterize
cicatrized
civilized
harmonized
idolizes
particularize
patronize patronized patronizes
proselytizer
realize realized
recognize recognized
tyrannize tyrannized
victimized victimizer

Today many of these words would have to be spelled -ise in England, e.g.
characterise, realise, recognise ....

Hope this helps,

Martin

----------------------------

The color/colour discrepancy is occasionally noted by users of the
algorithm, and I have seen adaptations where -our endings are respelled as
-or endings (or vice versa). But it is important to separate stemming from
spelling normalization. The -our ending of 'colour' is not a suffix. So it
might be respelled as -or, but that has nothing to do with suffix stripping.
My own feeling is that spelling normalization has its place, but that it
should be well-separated from a stemming process.

Any later comments you might have on the nordic stemmers would be welcomed,

Best Wishes,

Martin Porter
 
At 03:39 PM 6/16/02 +0200, Per Kristen Fredlund wrote:
>Hello!
>
>I'm a student at the Norwegian University of Science and Technology, NTNU.
>
>Currently i'm working on developing an information retrieval system.
>
>Looking at the rules set out for the Porter steming algorithm it is clear
>to me that this stemmer is intended for american/english.
>
>One example : (from step2)
>
>IZER -> IZE
>
>But if one considers british/english it is clear that also the following
>should be stemmed:
>
>ISER -> ISE
>
>This same Z/S-analogy is reflected also in other rules in the other steps
>of the algorithm.
>
>And then there is also the issue of or/our, like in color/colour. Within a
>collection of documents, both the term COLOR and COLOUR can occur. Both
>these terms should be stemmed to the same root. Obviously OR and
>OUR are 2 different words.
>
>I am sure that these issues can be implemented easily into either of the 2
>versions of your otherwise excellent stemming algorithm.
>
>
>
>Best regards,
>
>Per K Fredlund
>
>
>PS! I think i will also implement some of ur other stemming algorithms for
>the nordic languages norwegian, swedish and danish. Your norwegian stemmer
>is a lot more easy to code than using the Golden/Fjeldvig stemmer which
>has more than 600 rules.
>
>

_______________________________________________________________

Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST