[Snowball-discuss] Norwegian stemmer charset variants not in step

From: Olly Betts (olly@survex.com)
Date: Mon Sep 11 2006 - 13:07:09 BST


The two character set variants of the Norwegian stemmer have other
differences. I believe the ISO-8859-1 version is more up to date:

--- norwegian/stem_ISO_8859_1.sbl 2006-09-11 13:00:37.000000000 +0100
+++ norwegian/stem_MS_DOS_Latin_I.sbl 2006-09-11 13:00:37.000000000 +0100
@@ -13,15 +13,15 @@
 
 stringescapes {}
 
-/* special characters (in ISO Latin I) */
+/* special characters (in MS-DOS Latin I) */
 
-stringdef ae hex 'E6'
-stringdef ao hex 'E5'
-stringdef o/ hex 'F8'
+stringdef ae hex '91'
+stringdef ao hex '86'
+stringdef o/ hex '9B'
 
 define v 'aeiouy{ae}{ao}{o/}'
 
-define s_ending 'bcdfghjlmnoprtvyz'
+define s_ending 'bcdfghjklmnoprtvyz'
 
 define mark_regions as (
 
@@ -43,7 +43,7 @@
             'hetens' 'ers' 'ets' 'et' 'het' 'ast'
                 (delete)
             's'
- (s_ending or ('k' non-v) delete)
+ (s_ending delete)
             'erte' 'ert'
                 (<-'er')
         )

 
The other stemmers are consistent between character set variations,
except for a differently indented closing round bracket in the
swedish stemmer.

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST