Character codes | |||||||||||||||||||||||||||||||||||||||||||||
Links
| |||||||||||||||||||||||||||||||||||||||||||||
The Snowball scripts on this site define the codings of accented letters and other non-ASCII forms in a series of explicit declarations. For example, the German stemmer includes the lines /* special characters (in ISO Latin I) */ stringdef a" hex 'E4' stringdef o" hex 'F6' stringdef u" hex 'FC' stringdef ss hex 'DF'In the ISO Latin I code set, hex E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively. These codings in the stemmer scripts then correspond to the codings used in the sample data. For a more general approach, you may wish to replace the set of stringdefs by a get directive of the form, get 'ISO-Latin-1'possibly compiling with an -include option that declares the directory where this and other files are held, Snowball gstem.sbl -o gstem ... -include /home/shazzer/snowball/codesetsAppropriate code sets for ISO Latin I and MS-DOS Latin I are provided via the links above, and others will be added on demand or if submitted to us. For Russian, two sets of stringdefs are given in the script — KOI8-R, and (commented out) Unicode. For the other stemmers currently on offer the Unicode placings correspond to the ISO-Latin I placings, so no extra headers for Unicode need, at present, be given. If you wish to describe other Latin-alphabet based codesets for use in Snowball headers, you should adhere to the following conventions:
The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian o/, Icelandic d/ and Polish l/. Use ae and ss for æ ligature and the German ß, with upper case forms AE and SS. Use th for Icelandic thorn. |