Re: [Snowball-discuss] Stemming 'communing' and 'communed'

From: Michael Edwards (mbedwards@gmail.com)
Date: Thu Mar 29 2007 - 11:28:54 BST


On 3/29/07, Martin Porter <martin.porter@grapeshot.co.uk> wrote:
> > ... my algorithm stems it to "commun". I have run through the spec
> > 'by-hand' many times and cannot figure out how to get to the proper
> > stemming.
> >
> The reason is that prefix 'commun' is handled specially by Porter2 (see
> the 'mark_regions' routine) so that in effect it is treated as one
> syllable, rather than two syllables. So 'communing' behaves like
> 'tuning' etc. Similarly Porter2 stems 'communism' to 'communism' while
> Porter stems 'communism' to 'commun'.
>
> Were you thinking of contributing your PHP version to
>
> http://snowball.tartarus.org/otherlangs/index.html

Thanks for the reply!

I'm definitely planning to contribute the PHP version to the community
when I am confident it performs well in a production setting.

I currently have 'gener', 'commun', and 'arsen' as the exceptions you
reference. If I am correct, what you are saying is that I should
always treat these exceptional prefixes as short syllables? It is not
clear to me from reading the spec's definition of short syllables and
short words that I should be doing this. Rather, it reads as though
the only difference is in the setting of R1 which is not intrinsically
linked to the definition of short syllables or short words in the
spec. So, I am just looking for a little more clarification so that I
can try to future-proof my code with respect to additional exceptional
prefixes that may be added down the road.

Best regards,
Michael



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST