Links to resources
Here is a case study on how to code up a stemming algorithm in Snowball. First, the definition of the Porter stemmer, as it appeared in Program, Vol 14 no. 3 pp 130-137, July 1980.
THE ALGORITHMA consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.
A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms:
The ‘condition’ part may also contain the following:
In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with
In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows:
Now, turning it into Snowball.
The Porter stemmer makes a use of a measure, m, of the length of a word or word part. If C is a sequence of one or more consonants, and V a sequence of one or more vowels, any word part has the form
c r e p u s c u l a r | | | | | [C] V C V C V C V C 1 2 3 4Most of the rules for suffix removal involve leaving behind a stem whose measure exceeds some value, for example,
In fact the only tests on m in the Porter stemmer are m > 0, m > 1, and, at two interesting points, m = 1. This suggests that there are two critical positions in a word: the point at which, going from left to right, m > 0 becomes true, and then the point at which m > 1 becomes true. It turns out that m > 0 becomes true at the point after the first consonant following a vowel, and m > 1 becomes true at the point after the first consonant following a vowel following a consonant following a vowel. Calling these positions p1 and p2, we can determine them quite simply in Snowball:
define v 'aeiouy' .... do( gopast v gopast non-v setmark p1 gopast v gopast non-v setmark p2 )The region to the right of p1 will be denoted by R1, the region to the right of p2 by R2:
c r e p u s c u l a r | | p1 p2 <--- R1 ---> <-- R2 -->We can test for being in these regions with calls to R1 and R2, defined by,
define R1 as $p1 <= cursor define R2 as $p2 <= cursorand using these tests instead of computing m is acceptable, so long as the stemming process never alters the p1 and p2 positions, which is indeed true in the Porter stemmer.
A particularly interesting feature of the stemmers presented here is the common use they make of the positions p1 and p2. The details of marking p1 and p2 vary between the languages because the definitions of vowel and consonant vary. For example, French i preceded and followed by vowel should be treated as a consonant (inquiétude); Portuguese (ã and õ should be treated as a vowel-consonant pair (São João). A third important position is pV, which tries to mark the position of the shortest acceptable verb stem. Its definition varies somewhat between languages. The Porter stemmer does not use a pV explicitly, but the idea appears when the verb endings ing and ed are removed only when preceded by a vowel. In English therefore pV would be defined as the position after the first vowel.
The Porter stemmer is divided into five steps, step 1 is divided further into steps 1a, 1b and 1c, and step 5 into steps 5a and 5b. Step 1 removes the i-suffixes, and steps 2 to 4 the d-suffixes (*). Composite d-suffixes are reduced to single d-suffixes one at a time. So for example if a word ends icational, step 2 reduces it to icate and step 3 to ic. Three steps are sufficient for this process in English. Step 5 does some tidying up.
One can see how easily the stemming rules translate into Snowball by comparing the definition of Step 1a from the 1980 paper,
Step 1a: SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> catwith its Snowball equivalent,
define Step_1a as ( [substring] among ( 'sses' (<-'ss') 'ies' (<-'i') 'ss' () 's' (delete) ) )The word to be stemmed is being scanned right to left from the end. The longest of 'sses', 'ies', 'ss' or 's' is searched for and defined as the slice. (If none are found, Step_1a signals f.) If 'sses' is found, it is replaced by 'ss', and so on. Of course, replacing 'ss' by 'ss' is a dummy action, so we can write
'ss' ()instead of
'ss' (<-'ss')Remember that delete just means <- ''.
The really tricky part of the whole algorithm is step 1b, which may be worth looking at in detail. Here it is, without the example words on the far right,
Step 1b: (m > 0) EED -> EE (*v*) ED -> (*v*) ING -> If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE BL -> BLE IZ -> IZE (*d and not (*L or *S or *Z)) -> single letter (m = 1 and *o) -> EThe first part of the rule means that eed maps to ee if eed is in R1 (which is equivalent to m > 0), or ed and ing are removed if they are preceded by a vowel. In Snowball this is simply,
define Step_1b as ( [substring] among ( 'eed' (R1 <-'ee') 'ed' 'ing' (test gopast v delete) ) )But this must be modified by the second part of the rule. *d indicates a test for double letter consonant — bb, dd etc. *L, *S, *Z are tests for l, s, z. *o is a short vowel test — it is matched by consonant-vowel-consonant, where the consonant on the right is not w, x or y. If the short vowel test is satisfied, m = 1 is equivalent to the cursor being at p1. So the second part of the rule means, map at, bl, iz to ate, ble, ize; map certain double letters to single letters; and add e after a short vowel in words of one syllable.
We first need two extra groupings,
define v 'aeiouy' define v_WXY v + 'wxY' // v with 'w', 'x' and 'y'-consonant define v_LSZ v + 'lsz' // v with 'l', 's', 'z'and a test for a short vowel,
define shortv as ( non-v_WXY v non-v )(The v_WXY test comes first because we are scanning backwards, from right to left.)
The double to single letter map can be done as follows: first define the slice as the next non-v_LSZ and copy it to a string, ch, as a single character,
strings ( ch ) .... [non-v_LSZ] ->chA further test, ch, tests that the next letter of the string is the same as the one in ch, and if this gives signal t, delete deletes the slice,
[non-v_LSZ] ->ch ch deleteStep_1b can then be written like this,
define Step_1b as ( [substring] among ( 'eed' (R1 <-'ee') 'ed' 'ing' ( test gopast v delete (test among('at' 'bl' 'iz') <+ 'e') or ([non-v_LSZ]->ch ch delete) or (atmark p1 test shortv <+ 'e') ) ) )But we can improve the appearance, and speed, of this by turning the second part of the rule into another among command, noting that the only letters that need undoubling are b, d, f, g, m, n, p, r and t,
define Step_1b as ( [substring] among ( 'eed' (R1 <-'ee') 'ed' 'ing' ( test gopast v delete test substring among( 'at' 'bl' 'iz' (<+ 'e') 'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt' // ignoring double c, h, j, k, q, v, w, and x ([next] delete) '' (atmark p1 test shortv <+ 'e') ) ) ) )Note the null string in the second among, which acts as a default case.
The Porter stemmer in Snowball is given below. This is an exact implementation of the algorithm described in the 1980 paper, unlike the other implementations distributed by the author, which have, and have always had, three small points of difference (clearly indicated) from the original algorithm. Since all other implementations of the algorithm seen by the author are in some degree inexact, this may well be the first ever correct implementation.
The full algorithm in Snowball