RE: [Snowball-discuss] Small changes to English stemmer

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Fri Jan 13 2006 - 15:42:49 GMT


1. I don't understand what problem the first change (for ied and ies) is
intended to solve.

I think nowadays the most likely usage of "ied" is "improvised explosive
device".
Stemming this to "ie" is no better than, and perhaps worse than,
producing "i".
Perhaps the best treatment is to leave it alone, as "ied", so it will
conflate with "ieds".

The most likely use of "ie" (after "i.e." written without the periods)
is for Internet Explorer.
But this will be rarely spelled ies. The most likely usage of "ies" is
as an acronym. Google finds 16 million hits and the first 100 are all
acronyms. So again perhaps just leave it alone.

2. The most frequent use of a leading Y as vowel is in proper names,
e.g., Yvonne (13 M hits) and Yvette (5 M). But I do not think these are
affected by the second change, still producing:
yvonne -> yvonn
yvette -> yvett

Hopefully helpfully yours,
Steve

---
Steven Tolkin 
There is nothing so practical as a good theory.  Comments are by me, not
Fidelity Investments, its subsidiaries or affiliates.

-----Original Message----- From: snowball-discuss-bounces@lists.tartarus.org [mailto:snowball-discuss-bounces@lists.tartarus.org] On Behalf Of martin.porter@grapeshot.co.uk Sent: Monday, January 09, 2006 5:24 AM To: Snowball Discuss Subject: [Snowball-discuss] Small changes to English stemmer

There have been two small changes to the English (Porter2) stemming algorithm. The first is that the Rule

ied ies replace by ie if preceded by just one letter, otherwise by i

has been changed to

ied ies replace by i if preceded by more than one letter, otherwise by ie

There is a corresponding change in the Snowball script:

'ied' 'ies' ((next atlimit <-'ie') or <-'i')

'ied' 'ies' ((hop 2 <-'i') or <-'ie')

This ONLY affects the two 'words' ied and ies. Formerly they stemmed to i, now they stem to ie.

The second is that the line,

do ( ['y'] v <-'Y' set Y_found)

which did not match the Rule

Set initial y ... to Y,

has been changed to

do ( ['y'] <-'Y' set Y_found)

which does.

(The problem was whether to make the rule match the coding or the coding match the rule. The point is that in English initial y, when followed by consonant, is a vowel, but that only archaic words have this shape:- yclept and so on. I have decided to keep things simple and treat initial y as a consonant in all cases.)

Both these changes are trivial.

There is a rule to remove initial apostrophe in the stemmer, which I have come to think is a bit feeble, but it can be left in for now.

Martin

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST