The Snowball scripts on this site define the codings of accented letters and other non-ASCII forms in a series of explicit declarations. For example, the German stemmer includes the lines
/* special characters (in ISO Latin I) */ stringdef a" hex 'E4' stringdef o" hex 'F6' stringdef u" hex 'FC' stringdef ss hex 'DF'In the ISO Latin I code set, hex E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively. These codings in the stemmer scripts then correspond to the codings used in the sample data.
For a more general approach, you may wish to replace the set of stringdefs by a get directive of the form,
get 'ISO-Latin-1'possibly compiling with an -include option that declares the directory where this and other files are held,
Snowball gstem.sbl -o gstem ... -include /home/shazzer/snowball/codesetsAppropriate code sets for ISO Latin I and MS-DOS Latin I are provided via the links above, and others will be added on demand or if submitted to us.
For Russian, two sets of stringdefs are given in the script — KOI8-R, and (commented out) Unicode. For the other stemmers currently on offer the Unicode placings correspond to the ISO-Latin I placings, so no extra headers for Unicode need, at present, be given.
If you wish to describe other Latin-alphabet based codesets for use in Snowball headers, you should adhere to the following conventions:
The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian o/, Icelandic d/ and Polish l/.
Use ae and ss for æ ligature and the German ß, with upper case forms AE and SS. Use th for Icelandic thorn.