Re: [Snowball-discuss] The Norwegian stemmer algorithm

From: Martin Porter (martin_porter@softhome.net)
Date: Tue Nov 27 2001 - 15:14:14 GMT


>I'm making a port of the scandinavian stemmer algorithm
>for perl. You can fetch it from:
>
>http://www.unixmonks.net/~ask/Stemmer-Norwegian-0.3.tar.gz

Thanks, Ask, that is most interesting. I think it would be useful eventually
to have a collection of links to resources from the Snowball site. Could we
put your version in?

At present it is early days with Snowball, but if we imagine 10 to 20
stemmers coded up in 5 to 10 different languages around the Web, we could be
talking about a lot of information.

 
>
>There is one thing I can't understand, though,
>on the description of the algorithm you say:
>
>> R2 is not used: R1 is defined in the same way as in the German
>> stemmer.
>
>And on the German page, it says:
>
>> R1 and R2 are first set up in the standard way (see 3.1), but then R1
>> is adjusted so that the region before it contains at least 3 letters.
>
>Where is "3.1" ? :-)

I'm sorry about that. 3.1 is part of an old numbering scheme which I thought
I'd eliminated. I'll fix it. Go to the porter stemmer for the definition of
R1 and R2, although I guess you must know what the definiton is.

>If you unpack that tarball and try to run it against the diff.txt:
>% perl stemmer.pl diffs.txt | wc -l
>you'll see that 120 out of 20628 differs.
>
>Why???
>
>I'd guess this has something
>to with the snowball thingie:
>
>> $p1 = limit
>> goto v gopast non-v setmark p1
>> try ($p1 < 3 $p1 = 3)

Mmmm - I think no-one is reading the Snowball manual :-) . It sets p1 to 3
if is less than 3. So p1 is (a) after the first non-vowel following a vowel,
or (b) after the 3rd letter, whichever position is further right. Basically,
2 letters is too little for a residual stem in German, and I think Norwegian.

----

Any observations on the stemmer would be useful - I know little about Norwegian. Is a stemmer for Nynorsk of any importance?

Incidentally, how did you come across snowball? It is widely known as yet.

Martin

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________ VirusChecked by the Incepta Group plc _____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST