Links to resources | |||||||||||||||||||||||||||||||||||||||||||
Snowball definitionSnowball is a small string-handling language, and its name was chosen as a tribute to SNOBOL (Farber 1964, Griswold 1968 — see the references at the end of the introduction), with which it shares the concept of string patterns delivering signals that are used to control the flow of the program.1 Data typesThe basic data types handled by Snowball are strings of characters, signed integers, and boolean truth values, or more simply strings, integers and booleans. Snowball's characters are either 8-bit wide, or 16-bit, depending on the mode of use. In particular, both 8-bit ASCII and 16-bit Unicode are supported.2 NamesA name in Snowball is a letter followed by zero or more letters, digits and underlines. A name can be of type string, integer, boolean, routine, external or grouping. All names must be declared. A declaration has the formTs ( ... )where symbol T is one of string, integer etc, and the region in brackets contains a list of names separated by whitespace. For example, integers ( p1 p2 ) booleans ( Y_found ) routines ( shortv R1 R2 Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5a Step_5b ) externals ( stem ) groupings ( v v_WXY v_LSZ )p1 and p2 are integers, Y_found is boolean, and so on. Snowball is quite strict about the declarations, so all the names go in the same name space, no name may be declared twice, all used names must be declared, no two routine definitions can have the same name, etc. Names declared and subsequently not used are merely reported in a warning message. A name may not be one of the reserved words of Snowball. 3 LiteralsA literal integer is a digit sequence, and is always interpreted as decimal. A literal string is written between single quotes, for example,'aeiouy'In a stringdef (see below), string may be preceded by the word hex, or the word decimal, in which case the contents are interpreted as characters written out in hexadecimal, or decimal, notation. The characters should be separated by spaces. For example, hex 'DA' /* is character hex DA */ hex 'D A' /* is the two characters, hex D and A (carriage return, and line feed) */ decimal '10' /* character 10 (line feed) */ decimal '13 10' /* characters 13 and 10 (carriage return, and line feed) */The following forms are equivalent, hex 'd a' /* lower case also allowed */ hex '0D 000A' /* leading zeroes ignored */ hex ' D A ' /* extra spacing is harmless */stringdefs define special string macros, to handle unusual character combinations. Macro m may is defined in the form stringdef m 'S', where 'S' is a string, and m a sequence of one or more printing characters terminating with whitespace. Two special insert characters are defined by the directive stringescapes AB, where A and B are printing characters, and A is not single quote. (B may equal A, but then A itself can never be escaped.) For example, stringescapes []A subsequent occurrence of the same directive redefines the insert characters. Thereafter, [m] inside a string causes S to be substituted in place of m. Immediately after the stringescapes directive, ['] will substitute ' and [[] will substitute [, although macros ' and [ may subsequently be redefined. A further feature is that [W] inside a string, where W is a sequence of whitespace characters including one or more newlines, is ignored. This enables long strings to be written over a number of lines. For example, stringescapes [] /* special Spanish characters (in MS-DOS Latin I) */ stringdef a' hex 'A0' // a-acute stringdef e' hex '82' // e-acute stringdef i' hex 'A1' // i-acute stringdef o' hex 'A2' // o-acute stringdef u' hex 'A3' // u-acute stringdef u" hex '81' // u-diaeresis stringdef n~ hex 'A4' // n-tilde /* and in the next string we define all the characters in Spanish used to represent vowels */ define v 'aeiou[a'][e'][i'][o'][u'][u"]' 4 RoutinesA routine definition has the formdefine R as Cwhere R is the routine name and C is a command, or bracketed group of commands. So a routine is defined as a sequence of zero or more commands. Snowball routines do not (at present) take parameters. For example, define Step_5b as ( // this defines Step_5b ['l'] // three commands here: [, 'l' and ] R2 'l' // two commands, R2 and 'l' delete // delete is one command ) define R1 as $p1 <= cursor /* R1 is defined as the single command "$p1 <= cursor" */A routine is called simply by using its name, R, as a command. 5 Commands and signalsThe flow of control in Snowball is arranged by the implicit use of signals, rather than the explicit use of constructs like the if, then, break of C. The scheme is designed for handling strings, but is perhaps easier to introduce using integers. Suppose x, y, z ... are integers. The command$x = 1sets x to 1. The command $x > 0tests if x is greater than zero. Both commands give a signal t or f, (true or false), but while the second command gives t if x is greater than zero and f otherwise, the first command always gives t. In Snowball, every command gives a t or f signal. A sequence of commands can be turned into as a single command by putting them in a list surrounded by round brackets: ( C1 C2 C3 ... Ci Ci+1 ... )When this is obeyed, Ci+1 will be obeyed if each of the preceding C1 ... Ci give t, but as soon as a Ci gives f, the subsequent Ci+1 Ci+2 ... are ignored, and the whole sequence gives signal f. If all the Ci give t, however, the bracketed command sequence also gives t. So, $x > 0 $y = 1sets y to 1 if x is greater than zero. If x is less than or equal to zero the two commands give f. If C1 and C2 are commands, we can build up the larger commands,
try($x > 0 $z > 0 $y = 1)so that and seems unnecessary here. But we will see that and has a particular significance in string commands. When a ‘monadic’ construct like not, try or fail is not followed by a round bracket, the construct applies to the shortest following valid command. So for example try not $x < 1 $z > 0would mean try ( not ( $x < 1 ) ) $z > 0because $x < 1 is the shortest valid command following not, and then not $x < 1 is the shortest valid command following try. The ‘dyadic’ constructs like and and or must sit in a bracketed list of commands anyway, for example, ( C1 C2 and C3 C4 or C5 )And then in this case C2 and C3 are connected by the and; C4 and C5 are connected by the or. So $x > 0 not $y > 0 or not $z > 0 $t > 0means $x > 0 ((not ($y > 0)) or (not ($z > 0))) $t > 0and and or are equally binding, and bind from left to right, so C1 or C2 and C3 means (C1 or C2) and C3 etc. 6 AEs and integer commandsAn AE (arithmetic expression) consists of integer names and literal numbers connected by dyadic +, -, * and /, and monadic -, with the same binding powers and semantics as C. An integer command has the form$X op AEwhere X is an integer name and op is one of the six tests ==, !=, >=, >, <=, <, or five assignments =, +=, -=, *=, /=. Again, the meanings are the same as in C. As well as integer names and literal numbers, the following may be used in AEs:
Examples of integer commands are, $p1 <= cursor // signal is f if the cursor is before position p1 $p1 = limit // set p1 to the string limit 7 String commandsIf s is a string name, a string command has the form$s Cwhere C is a command that operate on the string. Strings can be processed left-to-right or right-to-left, but we will describe only the left-to-right case for now. The string has a cursor, which we will denote by c, and a limit point, or limit, which we will denote by l. c advances towards l in the course of a string command, but the various constructs and, or, not etc have side-effects which keep moving it backwards. Initially c is at the start and l the end of the string. For example, 'a|n|i|m|a|d|v|e|r|s|i|o|n' | | c lc, and l, mark the boundaries between characters, and not characters themselves. The characters between c and l will be denoted by c:l. If C gives t, the cursor c will have a new, well-defined value. But if C gives f, c is undefined. Its later value will in fact be determined by the outer context of commands in which C came to be obeyed, not by C itself. Here is a list of the commands that can be used to operate on strings. a) Setting a value
b) Basic tests
c) Moving text aboutWe have seen in (a) that $x = y, when x and y are strings, sets c:l of x to the value of y. Conversely$x => ysets the value of y to the c:l region of x. A more delicate mechanism for pushing text around is to define a substring, or slice of the string being tested. Then
/* assume x holds 'animadversion' */ $x ( [ // '[animadversion' - [ set as indicated loop 2 gopast 'a' // '[anima|dversion' - c is marked by '|' ] // '[anima]dversion' - ] set as indicated -> y // y is 'anima' )For any string, the slice ends should be assumed to be unset until they are set with the two commands [, ]. Thereafter the slice ends will retain the same values until altered.
define vowel ('a' or 'e' or 'i' or 'o' or 'u') .... $ x repeat ( gopast([vowel]) delete )As this example shows, the slice markers [ and ] often appear as pairs in a bracketed style, which makes for easy reading of the Snowball scripts. But it must be remembered that, unusually in a computer programming language, they are not true brackets. More simply, text can be inserted at c.
d) MarksThe cursor, c, (and the limit, l) can be thought of as having a numeric value, from zero upwards:| a | n | i | m | a | d | v | e | r | s | i | o | n | 0 1 2 3 4 5 6 7 8 9 10 11 12 13It is these numeric values of c and l which are accessible through cursor and limit in arithmetic expressions.
In the stemming algorithms, certain regions of the word are defined by setting marks, and later the failure condition of tomark is used to see if c is inside a particular region. Two other commands put c at l, and test if c is at l,
e) Changing lIn this account of string commands we see c moving right towards l, while l stays fixed at the end. In fact l can be reset to a new position between c and its old position, to act as a shorter barrier for the movement of c.
f) Backward processingString commands have been described with c to the left of l and moving right. But the process can be reversed.
$x ( 'ani' 'mad' 'version' atlimit ) $x backwards ( 'version' 'mad' 'ani' atlimit )If a routine is defined for backwards mode processing, it must be included inside a backwardmode(...) declaration. g) substring and amongThe use of substring and among is central to the implementation of the stemming algorithms. It is like a case switch on strings. In its simpler form,substring among('S1' 'S2' 'S3' ...)searches for the longest matching substring 'S1' or 'S2' or 'S3' ... from position c. (The 'Si' must all be different.) So this has the same semantics as ('S1' or 'S2' or 'S3' ...)— so long as the 'Si' are written out in decreasing order of length. substring may be omitted, in which case it is attached to its following among, so among(...)without a preceding substring is equivalent to (substring among(...))substring may also be detached from its among, although it must precede it textually in the same routine in which the among appears. The more general form of substring ... among is, substring ... among( 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )Obeying substring searches for a longest match among the 'Sij'. The signal from substring is t if a match is found, otherwise f. When the among comes to be obeyed, the Ci corresponding to the matched 'Sij' is obeyed, and its signal becomes the signal of the among command. substring/among pairs must match up textually inside each routine definition. But there is no problem with an among containing other substring/among pairs, and substring is optional before among anyway. The essential constraint is that two substrings must be separated by an among, and each substring must be followed by an among. The effect of obeying among when the preceding substring is not obeyed is undefined. This would happen for example here, try($x != 617 substring) among(...) // 'substring' is bypassed in the exceptional case where x == 617The significance of separating the substring from the among is to allow them to work in different contexts. For example, setlimit tomark L for substring among( 'S11' 'S12' ... (C1) ... 'Sn1' 'Sn2' ... (Cn) )Here the test for the longest 'Sij' is constrained to the region between c and the mark point given by integer L. But the commands Ci operate outside this limit. Another example is reverse substring among( 'S11' 'S12' ... (C1) ... 'Sn1' 'Sn2' ... (Cn) )The substring test is in the opposite direction in the string to the direction of the commands Ci. The last (Cn) may be omitted, in which case (true) is assumed. Another possible abbreviation is that when substring is omitted, a construct such as among( 'S11' 'S12' ... (C C1) 'S21' 'S22' ... (C C2) ... 'Sn1' 'Sn2' ... (C Cn) )can be written among( (C) 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )and this is just equivalent to substring C among( 'S11' 'S12' ... (C1) 'S21' 'S22' ... (C2) ... 'Sn1' 'Sn2' ... (Cn) )In its most general form, each string 'Sij' may be optionally followed by a routine name, among( (C) 'S11' R11 'S12' R12 ... (C1) 'S21' R21 'S22' R22 ... (C2) ... 'Sn1' Rn1 'Sn2' Rn1 ... (Cn) )So here each Rij is either a routine name or is null. If null, it is equivalent to a routine which simply returns signal t, define null as true— so we can imagine each 'Sij' having its associated routine Rij. Then obeying the among causes a search for the longest 'Sij' whose corresponding routine Rij gives t. The routines Rij should be written without any side-effects, other than the inevitable cursor movement. (c is in any case set back to its old value following a call of Rij.) 8 Booleansset B and unset B set B to true and false respectively, where B is a boolean name. B as a command gives a signal t if it is set true, f otherwise. For example,booleans ( Y_found ) // declare the boolean .... unset Y_found // unset it do ( ['y'] <-'Y' set Y_found ) /* if c:l begins 'y' replace it by 'Y' and set Y_found */ do repeat(goto (v ['y']) <-'Y' set Y_found) /* repeatedy move down the string looking for v 'y' and replacing 'y' with 'Y'. Whenever the replacement takes place set Y_found. v is a test for a vowel, defined as a grouping (see below). */ /* Y_found means there are some letters Y in the string. Later we can use this to trigger a conversion back to lower case y. */ .... do (Y_found repeat(goto (['Y']) <- 'y') 9 GroupingsA grouping brings characters together and enables them to be looked for with a single test.If G is declared as a grouping, it can be defined by define G G1 op G2 op G3 ...where op is + or -, and G1, G2, G3 are literal strings, or groupings that have already been defined. (There can be zero or more of these additional op components). For example, define capital_letter 'ABDEFGHIJKLMNOPQRSTUVWXYZ' define small_letter 'abdefghijklmnopqrstuvwxyz' define letter capital_letter + small_letter define vowel 'aeiou' + 'AEIOU' define consonant letter - vowel define digit '0123456789' define alphanumeric letter + digitOnce G is defined, it can be used as a command, and is equivalent to a test 'ch1' or 'ch2' or ...where ch1, ch2 ... list all the characters in the grouping. non G is the converse test, and matches any character except the characters of G. Note that non G is not the same as not G, in fact non G is equivalent to (not G next)non may be optionally followed by hyphen, so one may write non-vowel non-digitetc. 10 A Snowball programA complete program consists of a sequence of declarations followed by a sequence of definitions of groupings and routines. Routines which are implicitly defined as operating on c:l from right to left must be included in a backwardmode(...) declaration.A Snowball program is called up via a simple API through its defined externals. For example, externals ( stem1 stem2 ) .... define stem1 as ( ... /* stem1 commands */ ) define stem2 as ( ... /* stem2 commands */ )The API also allows a current string to be defined, and this becomes the c:l string for the external routine to work on. Its final value is the result handed back through the API. The strings, integers and booleans are accessible from any point in the program, and exist throughout the running of the Snowball program. They are therefore like static declarations in C. 11 Comments, and other whitespace fillersAt a deeper level, a program is a sequence of tokens, interspersed with whitespace. Names, reserved words, literal numbers and strings are all tokens. Various symbols, made up of non-alphanumerics, are also tokens.A name, reserved word or number is terminated by the first character that cannot form part of it. A symbol is recognised as the longest sequence of characters that forms a valid symbol. So +=- is two symbols, += and -, because += is a valid symbol in the language while +=- is not. Whitespace separates tokens but is otherwise ignored. This of course is like C. Anywhere that whitespace can occur, there may also occur: (a) Comments, in the usual multi-line /* .... */ or single line // ... format. (b) Get directives. These are like #include commands in C, and have the form get 'S', where 'S' is a literal string. For example, get '/home/martin/snowball/main-hdr' // include the file contents(c) stringescapes XY where X and Y are any two printing characters. (d) stringdef m 'S' where m is sequence of characters not including whitespace and terminated with whitespace, and 'S' is a literal string. 12 Character representationIn this description of Snowball, it is assumed that strings are composed of characters, and that characters can be defined numerically, but the numeric range of these characters is not defined. As implemented, three different schemes are supported. Characters can either be (a) bytes in the range 0 to 255, as in traditional C strings, or (b) byte pairs in the range 0 to 65535, as in Java strings, or (c) UTF-8 encoded bytes sequences in the range 0 to 65535, so that a character may occupy 1, 2 or 3 bytes.For case (c), we need to make a slight separation of the concept of characters into symbols, the units of text being represented, and slots, the units of space into which they map. (So in case (a), all slots are one byte; in case (b) all slots are two bytes.) c and l have numeric values that can be used in AEs (arithmetic expressions). These values count the number of slots. Similarly setmark, tomark and atmark are remembering and then using slot counts. size and sizeof measure string size in slots, not symbols. However, hop N moves c over N symbols, not N slots, and next is equivalent to hop 1. So long as these simple distinctions are recognised, the same Snowball script can be compiled to work with any of the three encoding schemes. | |||||||||||||||||||||||||||||||||||||||||||
Snowball syntax
|