[Snowball-discuss] Stemming 'communing' and 'communed'

From: Michael Edwards (mbedwards@gmail.com)
Date: Thu Mar 29 2007 - 02:46:55 BST


Greetings!

I am about to release the first version of my Porter2 stemming algorithm for
PHP (native PHP code, no C, no extensions). I have tested the algorithm
against the sample vocabulary word lists and am down to one error. Where the
sample word lists show that "communing" should stem to "commune" my
algorithm stems it to "commun". While not listed in the sample vocabulary,
"communed" is also stemmed to "commune" using the online Porter2 demo hosted
at the snowball.tartarus.org site, while my algorithm stems it to "commun".
I have run through the spec 'by-hand' many times and cannot figure out how
to get to the proper stemming.

The below is a run-thru of how I am interpreting the spec to get to
'commun':

1) Begin with 'communing'
2) R1: ing (per prefix exceptions for 'gener', 'commun', 'arsen'), R2: null
2) Prelude
3) Step 0
4) Step 1a
5) Step 1b, delete 'ing', get 'commun',
Note: try as I might, I cannot figure out how to come away with the
conclusion that the word is short and thus I should add an 'e' to the end.
6) Step 1c
7) Step 2
8) Step 3
9) Step 4
10) Step 5
11) Postlude

Result: 'commun'

Any thoughts or clarification would be much appreciated.

Best regards,
Michael Edwards



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST