Character codes




 

Links

Snowball main page

Snowball header files for
ISO Latin 1 (hex form)
ISO Latin 1 (decimal form)
MS-DOS Latin 1


The Snowball scripts on this site define the codings of accented letters and other non-ASCII forms in a series of explicit declarations. For example, the German stemmer includes the lines
    /* special characters (in ISO Latin I) */

    stringdef a"   hex 'E4'
    stringdef o"   hex 'F6'
    stringdef u"   hex 'FC'
    stringdef ss   hex 'DF'

In the ISO Latin I code set, hex E4, F6, FC and DF are the numeric values of characters ä, ö, ü and ß respectively. These codings in the stemmer scripts then correspond to the codings used in the sample data.

For a more general approach, you may wish to replace the set of  stringdefs by a  get  directive of the form,
    get 'ISO-Latin-1'
possibly compiling with an  -include  option that declares the directory where this and other files are held,
    Snowball gstem.sbl -o gstem ... -include /home/shazzer/snowball/codesets
Appropriate code sets for ISO Latin I and MS-DOS Latin I are provided via the links above, and others will be added on demand or if submitted to us.

For Russian, two sets of  stringdefs are given in the script — KOI8-R, and (commented out) Unicode. For the other stemmers currently on offer the Unicode placings correspond to the ISO-Latin I placings, so no extra headers for Unicode need, at present, be given.

If you wish to describe other Latin-alphabet based codesets for use in Snowball headers, you should adhere to the following conventions:
accent ASCII form example
acute single quote  e'
grave grave  a`
umlaut double quote  u"
circumflex circumflex  i^
cedilla comma  c,
tilde tilde  n~
ring letter o  ao
line through solidus  o/
And, should they ever arise, use  b  for breve (as in Rumanian),  l and  r  for left and right hook (as in Polish),  q  for double acute (as in Hungarian) and  v  for hacek (as in Czech).

The ‘line-through’ accent covers a numbers of miscellaneous cases: the Scandinavian  o/, Icelandic  d/  and Polish  l/.

Use  ae  and  ss  for æ ligature and the German ß, with upper case forms  AE  and  SS. Use  th  for Icelandic thorn.