Links to resources | |||||||||||||||||||||||||||||||||||||||||||
Snowball definitionSnowball is a small string-handling language, and its name was chosen as a tribute to SNOBOL (Farber 1964, Griswold 1968 — see the references at the end of the introduction), with which it shares the concept of string patterns delivering signals that are used to control the flow of the program.1 Data typesThe basic data types handled by Snowball are strings of characters, signed integers, and boolean truth values, or more simply strings, integers and booleans. Snowball's characters are either 8-bit wide, or 16-bit, depending on the mode of use. In particular, both 8-bit ASCII and 16-bit Unicode are supported.2 NamesA name in Snowball is a letter followed by zero or more letters, digits and underlines. A name can be of type string, integer, boolean, routine, external or grouping. All names must be declared. A declaration has the form
Ts ( ... )
where symbol T is one of string, integer etc, and the region in
brackets contains a list of names separated by whitespace. For example,
integers ( p1 p2 )
booleans ( Y_found )
routines (
shortv
R1 R2
Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b
)
externals ( stem )
groupings ( v v_WXY v_LSZ )
p1 and p2 are integers, Y_found is boolean, and so on. Snowball is quite
strict about the declarations, so all the names go in the same name space,
no name may be declared twice, all used names must be declared, no two
routine definitions can have the same name, etc. Names declared and
subsequently not used are merely reported in a warning message. A name may
not be one of the reserved words of Snowball.
3 LiteralsA literal integer is a digit sequence, and is always interpreted as decimal. A literal string is written between single quotes, for example,
'aeiouy'
In a stringdef (see below), string may be preceded by the word hex,
or the word decimal,
in which case the contents are
interpreted as characters written out in hexadecimal, or decimal, notation.
The characters should be separated by spaces. For example,
hex 'DA' /* is character hex DA */
hex 'D A' /* is the two characters, hex D and A (carriage
return, and line feed) */
decimal '10' /* character 10 (line feed) */
decimal '13 10' /* characters 13 and 10 (carriage return, and
line feed) */
The following forms are equivalent,
hex 'd a' /* lower case also allowed */
hex '0D 000A' /* leading zeroes ignored */
hex ' D A ' /* extra spacing is harmless */
stringdefs define special string macros, to handle unusual
character combinations.
Macro m may is defined in the form stringdef m 'S', where 'S' is a string, and m a sequence of one or more printing characters terminating with whitespace. Two special insert characters are defined by the directive stringescapes AB, where A and B are printing characters, and A is not single quote. (B may equal A, but then A itself can never be escaped.) For example,
stringescapes []
A subsequent occurrence of the same directive redefines the insert
characters.
Thereafter, [m] inside a string causes S to be substituted in place of m. Immediately after the stringescapes directive, ['] will substitute ' and [[] will substitute [, although macros ' and [ may subsequently be redefined. A further feature is that [W] inside a string, where W is a sequence of whitespace characters including one or more newlines, is ignored. This enables long strings to be written over a number of lines. For example,
stringescapes []
/* special Spanish characters (in MS-DOS Latin I) */
stringdef a' hex 'A0' // a-acute
stringdef e' hex '82' // e-acute
stringdef i' hex 'A1' // i-acute
stringdef o' hex 'A2' // o-acute
stringdef u' hex 'A3' // u-acute
stringdef u" hex '81' // u-diaeresis
stringdef n~ hex 'A4' // n-tilde
/* and in the next string we define all the characters in Spanish
used to represent vowels
*/
define v 'aeiou[a'][e'][i'][o'][u'][u"]'
4 RoutinesA routine definition has the form
define R as C
where R is the routine name and C is a command, or bracketed group of
commands. So a routine is defined as a sequence of zero or more commands.
Snowball routines do not (at present) take parameters. For example,
define Step_5b as ( // this defines Step_5b
['l'] // three commands here: [, 'l' and ]
R2 'l' // two commands, R2 and 'l'
delete // delete is one command
)
define R1 as $p1 <= cursor
/* R1 is defined as the single command "$p1 <= cursor" */
A routine is called simply by using its name, R, as a command.
5 Commands and signalsThe flow of control in Snowball is arranged by the implicit use of signals, rather than the explicit use of constructs like the if, then, break of C. The scheme is designed for handling strings, but is perhaps easier to introduce using integers. Suppose x, y, z ... are integers. The command
$x = 1
sets x to 1. The command
$x > 0
tests if x is greater than zero. Both commands give a signal t or f,
(true or false), but while the second command gives t if x is greater
than zero and f otherwise, the first command always gives t. In Snowball,
every command gives a t or f signal. A sequence of commands can be turned
into as a single command by putting them in a list surrounded by round
brackets:
( C1 C2 C3 ... Ci Ci+1 ... )
When this is obeyed, Ci+1 will be obeyed if each of the preceding C1 ...
Ci give t, but as soon as a Ci gives f, the subsequent Ci+1 Ci+2 ...
are ignored, and the whole sequence gives signal f. If all the Ci give t,
however, the bracketed command sequence also gives t. So,
$x > 0 $y = 1
sets y to 1 if x is greater than zero. If x is less than or equal to zero
the two commands give f.
If C1 and C2 are commands, we can build up the larger commands,
try($x > 0 $z > 0 $y = 1)
so that and seems unnecessary here. But we will see that and has a
particular significance in string commands.
When a ‘monadic’ construct like not, try or fail is not followed by a round bracket, the construct applies to the shortest following valid command. So for example
try not $x < 1 $z > 0
would mean
try ( not ( $x < 1 ) ) $z > 0
because $x < 1 is the shortest valid command following not, and then
not $x < 1 is the shortest valid command following try.
The ‘dyadic’ constructs like and and or must sit in a bracketed list of commands anyway, for example,
( C1 C2 and C3 C4 or C5 )
And then in this case C2 and C3 are connected by the and; C4 and C5 are
connected by the or. So
$x > 0 not $y > 0 or not $z > 0 $t > 0
means
$x > 0 ((not ($y > 0)) or (not ($z > 0))) $t > 0
and and or are equally binding, and bind from left to right,
so C1 or C2 and C3 means (C1 or C2) and C3 etc.
6 AEs and integer commandsAn AE (arithmetic expression) consists of integer names and literal numbers connected by dyadic +, -, * and /, and monadic -, with the same binding powers and semantics as C. An integer command has the form
$X op AE
where X is an integer name and op is one of the six tests ==, !=, >=, >,
<=, <, or five assignments =, +=, -=, *=, /=. Again, the meanings are the
same as in C.
As well as integer names and literal numbers, the following may be used in AEs:
Examples of integer commands are,
$p1 <= cursor // signal is f if the cursor is before position p1
$p1 = limit // set p1 to the string limit
7 String commandsIf s is a string name, a string command has the form
$s C
where C is a command that operate on the string. Strings can be processed
left-to-right or right-to-left, but we will describe only the
left-to-right case for now. The string has a cursor, which we will
denote by c, and a limit point, or limit, which we will denote by l. c
advances towards l in the course of a string command, but the various
constructs and, or, not etc have side-effects which keep moving it
backwards. Initially c is at the start and l the end of the string. For
example,
'a|n|i|m|a|d|v|e|r|s|i|o|n'
| |
c l
c, and l, mark the boundaries between characters, and not
characters themselves. The characters between c and l will be denoted by
c:l.
If C gives t, the cursor c will have a new, well-defined value. But if C gives f, c is undefined. Its later value will in fact be determined by the outer context of commands in which C came to be obeyed, not by C itself. Here is a list of the commands that can be used to operate on strings. a) Setting a value
b) Basic tests
c) Moving text aboutWe have seen in (a) that $x = y, when x and y are strings, sets c:l of x to the value of y. Conversely
$x => y
sets the value of y to the c:l region of x.
A more delicate mechanism for pushing text around is to define a substring, or slice of the string being tested. Then
/* assume x holds 'animadversion' */
$x ( [ // '[animadversion' - [ set as indicated
loop 2 gopast 'a'
// '[anima|dversion' - c is marked by '|'
] // '[anima]dversion' - ] set as indicated
-> y // y is 'anima'
)
For any string, the slice ends should be assumed to be unset until they are
set with the two commands [, ]. Thereafter the slice ends will retain
the same values until altered.
define vowel ('a' or 'e' or 'i' or 'o' or 'u')
....
$ x repeat ( gopast([vowel]) delete )
As this example shows, the slice markers [ and ] often appear as
pairs in a bracketed style, which makes for easy reading of the Snowball
scripts. But it must be remembered that, unusually in a computer
programming language, they are not true brackets.
More simply, text can be inserted at c.
d) MarksThe cursor, c, (and the limit, l) can be thought of as having a numeric value, from zero upwards:
| a | n | i | m | a | d | v | e | r | s | i | o | n |
0 1 2 3 4 5 6 7 8 9 10 11 12 13
It is these numeric values of c and l which are accessible through
cursor and limit in arithmetic expressions.
In the stemming algorithms, certain regions of the word are defined by setting marks, and later the failure condition of tomark is used to see if c is inside a particular region. Two other commands put c at l, and test if c is at l,
e) Changing lIn this account of string commands we see c moving right towards l, while l stays fixed at the end. In fact l can be reset to a new position between c and its old position, to act as a shorter barrier for the movement of c.
f) Backward processingString commands have been described with c to the left of l and moving right. But the process can be reversed.
$x (
'ani' 'mad' 'version' atlimit
)
$x backwards (
'version' 'mad' 'ani' atlimit
)
If a routine is defined for backwards mode processing, it must be included
inside a backwardmode(...) declaration.
g) substring and amongThe use of substring and among is central to the implementation of the stemming algorithms. It is like a case switch on strings. In its simpler form,
substring among('S1' 'S2' 'S3' ...)
searches for the longest matching substring 'S1' or 'S2' or 'S3' ... from
position c. (The 'Si' must all be different.) So this has the same
semantics as
('S1' or 'S2' or 'S3' ...)
— so long as the 'Si' are written out in decreasing order of length.
substring may be omitted, in which case it is attached to its following among, so
among(...)
without a preceding substring is equivalent to
(substring among(...))
substring may also be detached from its among, although it must
precede it textually in the same routine in which the among appears.
The more general form of substring ... among is,
substring
...
among( 'S11' 'S12' ... (C1)
'S21' 'S22' ... (C2)
...
'Sn1' 'Sn2' ... (Cn)
)
Obeying substring searches for a longest match among the 'Sij'. The
signal from substring is t if a match is found, otherwise f. When the
among comes to be obeyed, the Ci corresponding to the matched 'Sij' is
obeyed, and its signal becomes the signal of the among command.
substring/among pairs must match up textually inside each routine definition. But there is no problem with an among containing other substring/among pairs, and substring is optional before among anyway. The essential constraint is that two substrings must be separated by an among, and each substring must be followed by an among. The effect of obeying among when the preceding substring is not obeyed is undefined. This would happen for example here,
try($x != 617 substring)
among(...) // 'substring' is bypassed in the exceptional case where x == 617
The significance of separating the substring from the among is to allow
them to work in different contexts. For example,
setlimit tomark L for substring
among( 'S11' 'S12' ... (C1)
...
'Sn1' 'Sn2' ... (Cn)
)
Here the test for the longest 'Sij' is constrained to the region between c
and the mark point given by integer L. But the commands Ci operate outside
this limit. Another example is
reverse substring
among( 'S11' 'S12' ... (C1)
...
'Sn1' 'Sn2' ... (Cn)
)
The substring test is in the opposite direction in the string to the
direction of the commands Ci.
The last (Cn) may be omitted, in which case (true) is assumed. Another possible abbreviation is that when substring is omitted, a construct such as
among( 'S11' 'S12' ... (C C1)
'S21' 'S22' ... (C C2)
...
'Sn1' 'Sn2' ... (C Cn)
)
can be written
among( (C)
'S11' 'S12' ... (C1)
'S21' 'S22' ... (C2)
...
'Sn1' 'Sn2' ... (Cn)
)
and this is just equivalent to
substring C
among( 'S11' 'S12' ... (C1)
'S21' 'S22' ... (C2)
...
'Sn1' 'Sn2' ... (Cn)
)
In its most general form, each string 'Sij' may be optionally followed by a
routine name,
among( (C)
'S11' R11 'S12' R12 ... (C1)
'S21' R21 'S22' R22 ... (C2)
...
'Sn1' Rn1 'Sn2' Rn1 ... (Cn)
)
So here each Rij is either a routine name or is null. If null, it is equivalent
to a routine which simply returns signal t,
define null as true
— so we can imagine each 'Sij' having its associated routine
Rij. Then obeying the among causes a search for the longest
'Sij' whose corresponding routine
Rij gives t. The routines
Rij should be written without any side-effects, other than the inevitable cursor
movement. (c is in any case set back to its old value following a call of
Rij.)
8 Booleansset B and unset B set B to true and false respectively, where B is a boolean name. B as a command gives a signal t if it is set true, f otherwise. For example,
booleans ( Y_found ) // declare the boolean
....
unset Y_found // unset it
do ( ['y'] <-'Y' set Y_found )
/* if c:l begins 'y' replace it by 'Y' and set Y_found */
do repeat(goto (v ['y']) <-'Y' set Y_found)
/* repeatedy move down the string looking for v 'y' and
replacing 'y' with 'Y'. Whenever the replacement takes
place set Y_found. v is a test for a vowel, defined as
a grouping (see below). */
/* Y_found means there are some letters Y in the string.
Later we can use this to trigger a conversion back to
lower case y. */
....
do (Y_found repeat(goto (['Y']) <- 'y')
9 GroupingsA grouping brings characters together and enables them to be looked for with a single test.If G is declared as a grouping, it can be defined by
define G G1 op G2 op G3 ...
where op is + or -, and G1, G2, G3 are literal strings, or groupings that
have already been defined. (There can be zero or more of these additional
op components). For example,
define capital_letter 'ABDEFGHIJKLMNOPQRSTUVWXYZ'
define small_letter 'abdefghijklmnopqrstuvwxyz'
define letter capital_letter + small_letter
define vowel 'aeiou' + 'AEIOU'
define consonant letter - vowel
define digit '0123456789'
define alphanumeric letter + digit
Once G is defined, it can be used as a command, and is equivalent to a test
'ch1' or 'ch2' or ...
where ch1, ch2 ... list all the characters in the grouping.
non G is the converse test, and matches any character except the characters of G. Note that non G is not the same as not G, in fact
non G is equivalent to (not G next)
non may be optionally followed by hyphen, so one may write
non-vowel
non-digit
etc.
10 A Snowball programA complete program consists of a sequence of declarations followed by a sequence of definitions of groupings and routines. Routines which are implicitly defined as operating on c:l from right to left must be included in a backwardmode(...) declaration.A Snowball program is called up via a simple API through its defined externals. For example,
externals ( stem1 stem2 )
....
define stem1 as ( ... /* stem1 commands */ )
define stem2 as ( ... /* stem2 commands */ )
The API also allows a current string to be defined, and this becomes the
c:l string for the external routine to work on. Its final value is the
result handed back through the API.
The strings, integers and booleans are accessible from any point in the program, and exist throughout the running of the Snowball program. They are therefore like static declarations in C. 11 Comments, and other whitespace fillersAt a deeper level, a program is a sequence of tokens, interspersed with whitespace. Names, reserved words, literal numbers and strings are all tokens. Various symbols, made up of non-alphanumerics, are also tokens.A name, reserved word or number is terminated by the first character that cannot form part of it. A symbol is recognised as the longest sequence of characters that forms a valid symbol. So +=- is two symbols, += and -, because += is a valid symbol in the language while +=- is not. Whitespace separates tokens but is otherwise ignored. This of course is like C. Anywhere that whitespace can occur, there may also occur: (a) Comments, in the usual multi-line /* .... */ or single line // ... format. (b) Get directives. These are like #include commands in C, and have the form get 'S', where 'S' is a literal string. For example,
get '/home/martin/snowball/main-hdr' // include the file contents
(c) stringescapes XY where X and Y are any two printing characters.
(d) stringdef m 'S' where m is sequence of characters not including whitespace and terminated with whitespace, and 'S' is a literal string. 12 Character representationIn this description of Snowball, it is assumed that strings are composed of characters, and that characters can be defined numerically, but the numeric range of these characters is not defined. As implemented, three different schemes are supported. Characters can either be (a) bytes in the range 0 to 255, as in traditional C strings, or (b) byte pairs in the range 0 to 65535, as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0 to 65535, so that a character may occupy 1, 2 or 3 bytes.For case (c), we need to make a slight separation of the concept of characters into symbols, the units of text being represented, and slots, the units of space into which they map. (So in case (a), all slots are one byte; in case (b) all slots are two bytes.) c and l have numeric values that can be used in AEs (arithmetic expressions). These values count the number of slots. Similarly setmark, tomark and atmark are remembering and then using slot counts. size and sizeof measure string size in slots, not symbols. However, hop N moves c over N symbols, not N slots, and next is equivalent to hop 1. So long as these simple distinctions are recognised, the same Snowball script can be compiled to work with any of the three encoding schemes. | |||||||||||||||||||||||||||||||||||||||||||
Snowball syntax
|