Re: [Snowball-discuss] The Norwegian stemmer algorithm

From: Martin Porter (martin_porter@softhome.net)
Date: Thu Nov 29 2001 - 08:30:17 GMT


Ask,

>But as far as I can tell, this algorithm already takes a lot of nynorsk,
>because -ar, -ande, -ast, -ane, -eleg, -eig and -leg is not "bokmål" but
>nynorsk.

I developed the algorithm with a particular vocabulary which I put together
myself by downloads from the Web. I had assumed that the texts were entirely
bokmal Norwegian, but I must have been in error here. I am quite willing to
redo the work if you can guide me to texts in nynorsk and bokmal separately
- you need about 4-5 megabytes of a language as a sample, and the texts
should be as plain as possible as far as mark-up goes, and representative of
the contemporary language. If on the other hand the simple Norwegian stemmer
I've presented works equally on bokmal and nynorsk so much the better. I
suppose in setting up IR systems of Norwegian text it must be an
inconvenience needing to separate the two dialects.

Martin

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST