The Lovins stemming algorithm
Links to resources
|
The first ever published stemming algorithm was: Lovins JB (1968) Development of
a stemming algorithm. Mechanical Translation and Computational Linguistics,
11: 22-31. Julie Beth Lovins’ paper was remarkable for the early date at which
it was done, and for its seminal influence on later work in
this area.
The design of the algorithm was much influenced by the technical vocabulary
with which Lovins found herself working (subject term keywords attached to
documents in the materials science and engineering field). The subject term
list may also have been slightly limiting in that certain common endings
are not represented (ements and ents for example, corresponding to
the singular forms ement and ent), and also in that the algorithm's
treatment of short words, or words with short stems, can be rather
destructive.
The Lovins algorithm is noticeably bigger than the Porter algorithm,
because of its very extensive endings list. But in one way that is used to
advantage: it is faster. It has effectively traded space for time, and with
its large suffix set it needs just two major steps to remove a suffix,
compared with the eight of the Porter algorithm.
The Lovins stemmer has 294 endings, 29 conditions and 35
transformation rules. Each ending is associated with one of the
conditions. In the first step the longest ending is found which satisfies
its associated condition, and is removed. In the second step the 35 rules
are applied to transform the ending. The second step is done whether or not
an ending is removed in the first step.
For example, nationally has the ending ationally, with associated
condition, B, ‘minimum stem length = 3’. Since removing ationally
would leave a stem of length 1 this is rejected. But it also has ending
ionally with associated condition A. Condition A is ‘no restriction on
stem length’, so ionally is removed, leaving nat.
The transformation rules handle features like letter undoubling (sitting
-> sitt -> sit), irregular plurals (matrix and matrices),
and English morphological oddities ultimately caused by the behaviour of
Latin verbs of the second conjugation (assume / assumption,
commit / commission etc). Although they are described as being
applied in turn, they can be broken into two stages, rule 1 being done in
stage 1, and either zero or one of rules 2 to 35 being done in stage 2.
Here is the list of endings as given in Appendix A of Lovins’ paper. They
are grouped by length, from 11 characters down to 1. Each ending is
followed by its condition code.
|
Appendix A. The list of endings
|
| | .11. |
| | alistically B | | arizability A | | izationally B |
|
| | .10. |
| | antialness A | | arisations A | | arizations A | | entialness A |
|
| | .09. |
| | allically C | | antaneous A | | antiality A | | arisation A |
| | arization A | | ationally B | | ativeness A | | eableness E |
| | entations A | | entiality A | | entialize A | | entiation A |
| | ionalness A | | istically A | | itousness A | | izability A |
| | izational A |
|
| | .08. |
| | ableness A | | arizable A | | entation A | | entially A |
| | eousness A | | ibleness A | | icalness A | | ionalism A |
| | ionality A | | ionalize A | | iousness A | | izations A |
| | lessness A |
|
| | .07. |
| | ability A | | aically A | | alistic B | | alities A |
| | ariness E | | aristic A | | arizing A | | ateness A |
| | atingly A | | ational B | | atively A | | ativism A |
| | elihood E | | encible A | | entally A | | entials A |
| | entiate A | | entness A | | fulness A | | ibility A |
| | icalism A | | icalist A | | icality A | | icalize A |
| | ication G | | icianry A | | ination A | | ingness A |
| | ionally A | | isation A | | ishness A | | istical A |
| | iteness A | | iveness A | | ivistic A | | ivities A |
| | ization F | | izement A | | oidally A | | ousness A |
|
| | .06. |
| | aceous A | | acious B | | action G | | alness A |
| | ancial A | | ancies A | | ancing B | | ariser A |
| | arized A | | arizer A | | atable A | | ations B |
| | atives A | | eature Z | | efully A | | encies A |
| | encing A | | ential A | | enting C | | entist A |
| | eously A | | ialist A | | iality A | | ialize A |
| | ically A | | icance A | | icians A | | icists A |
| | ifully A | | ionals A | | ionate D | | ioning A |
| | ionist A | | iously A | | istics A | | izable E |
| | lessly A | | nesses A | | oidism A |
|
| | .05. |
| | acies A | | acity A | | aging B | | aical A |
| | alist A | | alism B | | ality A | | alize A |
| | allic BB | | anced B | | ances B | | antic C |
| | arial A | | aries A | | arily A | | arity B |
| | arize A | | aroid A | | ately A | | ating I |
| | ation B | | ative A | | ators A | | atory A |
| | ature E | | early Y | | ehood A | | eless A |
| | elity A | | ement A | | enced A | | ences A |
| | eness E | | ening E | | ental A | | ented C |
| | ently A | | fully A | | ially A | | icant A |
| | ician A | | icide A | | icism A | | icist A |
| | icity A | | idine I | | iedly A | | ihood A |
| | inate A | | iness A | | ingly B | | inism J |
| | inity CC | | ional A | | ioned A | | ished A |
| | istic A | | ities A | | itous A | | ively A |
| | ivity A | | izers F | | izing F | | oidal A |
| | oides A | | otide A | | ously A |
|
| | .04. |
| | able A | | ably A | | ages B | | ally B |
| | ance B | | ancy B | | ants B | | aric A |
| | arly K | | ated I | | ates A | | atic B |
| | ator A | | ealy Y | | edly E | | eful A |
| | eity A | | ence A | | ency A | | ened E |
| | enly E | | eous A | | hood A | | ials A |
| | ians A | | ible A | | ibly A | | ical A |
| | ides L | | iers A | | iful A | | ines M |
| | ings N | | ions B | | ious A | | isms B |
| | ists A | | itic H | | ized F | | izer F |
| | less A | | lily A | | ness A | | ogen A |
| | ward A | | wise A | | ying B | | yish A |
|
| | .03. |
| | acy A | | age B | | aic A | | als BB |
| | ant B | | ars O | | ary F | | ata A |
| | ate A | | eal Y | | ear Y | | ely E |
| | ene E | | ent C | | ery E | | ese A |
| | ful A | | ial A | | ian A | | ics A |
| | ide L | | ied A | | ier A | | ies P |
| | ily A | | ine M | | ing N | | ion Q |
| | ish C | | ism B | | ist A | | ite AA |
| | ity A | | ium A | | ive A | | ize F |
| | oid A | | one R | | ous A |
|
| | .02. |
| | ae A | | al BB | | ar X | | as B |
| | ed E | | en F | | es E | | ia A |
| | ic A | | is A | | ly B | | on S |
| | or T | | um U | | us V | | yl R |
| | s' A | | 's A |
|
| | .01. |
| | a A | | e A | | i A | | o A |
| | s W | | y B |
|
Here are the 29 conditions, called A to Z, AA, BB and CC (* stands for any letter):
|
Appendix B. Codes for context-sensitive rules associated with
certain endings
-
| | A | | No restrictions on stem |
| | B | | Minimum stem length = 3 |
| | C | | Minimum stem length = 4 |
| | D | | Minimum stem length = 5 |
| | E | | Do not remove ending after e |
| | F | | Minimum stem length = 3 and do not remove ending after e |
| | G | | Minimum stem length = 3 and remove ending only after f |
| | H | | Remove ending only after t or ll |
| | I | | Do not remove ending after o or e |
| | J | | Do not remove ending after a or e |
| | K | | Minimum stem length = 3 and remove ending only after l, i or u*e |
| | L | | Do not remove ending after u, x or s, unless s follows o |
| | M | | Do not remove ending after a, c, e or m |
| | N | | Minimum stem length = 4 after s**, elsewhere = 3 |
| | O | | Remove ending only after l or i |
| | P | | Do not remove ending after c |
| | Q | | Minimum stem length = 3 and do not remove ending after l or n |
| | R | | Remove ending only after n or r |
| | S | | Remove ending only after dr or t, unless t follows t |
| | T | | Remove ending only after s or t, unless t follows o |
| | U | | Remove ending only after l, m, n or r |
| | V | | Remove ending only after c |
| | W | | Do not remove ending after s or u |
| | X | | Remove ending only after l, i or u*e |
| | Y | | Remove ending only after in |
| | Z | | Do not remove ending after f |
| | AA | | Remove ending only after d, f, ph, th, l, er, or, es or t |
| | BB | | Minimum stem length = 3 and do not remove ending after met or ryst |
| | CC | | Remove ending only after l |
|
There is an implicit assumption in each condition, A included, that the minimum
stem length is 2.
Finally, here are the 35 transformation rules.
|
|
Appendix C. Transformation rules used in recoding stem terminations
-
| | 1 | | remove one of double b, d, g, l, m, n, p, r, s, t |
| | 2 | | iev -> ief |
| | 3 | | uct -> uc |
| | 4 | | umpt -> um |
| | 5 | | rpt -> rb |
| | 6 | | urs -> ur |
| | 7 | | istr -> ister |
| | 7a | | metr -> meter |
| | 8 | | olv -> olut |
| | 9 | | ul -> l except following a, o, i |
| | 10 | | bex -> bic |
| | 11 | | dex -> dic |
| | 12 | | pex -> pic |
| | 13 | | tex -> tic |
| | 14 | | ax -> ac |
| | 15 | | ex -> ec |
| | 16 | | ix -> ic |
| | 17 | | lux -> luc |
| | 18 | | uad -> uas |
| | 19 | | vad -> vas |
| | 20 | | cid -> cis |
| | 21 | | lid -> lis |
| | 22 | | erid -> eris |
| | 23 | | pand -> pans |
| | 24 | | end -> ens except following s |
| | 25 | | ond -> ons |
| | 26 | | lud -> lus |
| | 27 | | rud -> rus |
| | 28 | | her -> hes except following p, t |
| | 29 | | mit -> mis |
| | 30 | | ent -> ens except following m |
| | 31 | | ert -> ers |
| | 32 | | et -> es except following n |
| | 33 | | yt -> ys |
| | 34 | | yz -> ys |
|
(Rule 30 as given here corrects a typographical error in the published
paper of 1968.)
The following examples show the intentions behind these rules.
-
| | 1 | | rubb[ing] -> rub, embedd[ed] -> embed etc |
| | 2 | | believ[e] -> belief |
| | 3 | | induct[ion] -> induc[e] |
| | 4 | | consumpt[ion] -> consum[e] |
| | 5 | | absorpt[ion] -> absorb |
| | 6 | | recurs[ive] -> recur |
| | 7 | | administr[ate] -> administ[er] |
| | 7a | | parametr[ic] -> paramet[er] |
| | 8 | | dissolv[ed] -> dissolut[ion] |
| | 9 | | angul[ar] -> angl[e] |
| | 10 | | vibex -> vibic[es] |
| | 11 | | index -> indic[es] |
| | 12 | | apex -> apic[es] |
| | 13 | | cortex -> cortic[al] |
| | 14 | | anthrax -> anthrac[ite] |
| | 15 | | ? |
| | 16 | | matrix -> matric[es] |
| | 17 | | ? |
| | 18 | | persuad[e] -> persuas[ion] |
| | 19 | | evad[e] -> evas[ion] |
| | 20 | | decid[e] -> decis[ion] |
| | 21 | | elid[e] -> elis[ion] |
| | 22 | | derid[e] -> deris[ion] |
| | 23 | | expand -> expans[ion] |
| | 24 | | defend -> defens[ive] |
| | 25 | | respond -> respons[ive] |
| | 26 | | collud[e] -> collus[ion] |
| | 27 | | obtrud[e] -> obtrus[ion] |
| | 28 | | adher[e] -> adhes[ion] |
| | 29 | | remit -> remis[s][ion] |
| | 30 | | extent -> extens[ion] |
| | 31 | | convert[ed] -> convers[ion] |
| | 32 | | parenthet[ic] -> parenthes[is] |
| | 33 | | analyt[ic] -> analys[is] |
| | 34 | | analyz[ed] -> analys[ed] |
|
The Lovins algorithm in Snowball
And here is the Lovins algorithm in Snowball. The natural representation
of the Lovins endings, conditions and rules in Snowball, is, I believe, a
vindication of the appropriateness of Snowball for stemming work. Once the
tables had been established, getting the Snowball version running was the
work of a few minutes.
|
-
stringescapes {}
routines (
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AA BB CC
endings
undouble respell
)
externals ( stem )
backwardmode (
/* Lovins' conditions A, B ... CC, as given in her Appendix B, where
a test for a two letter prefix ('test hop 2') is implicitly
assumed. Note that 'e' next 'u' corresponds to her u*e because
Snowball is scanning backwards. */
define A as ( hop 2 )
define B as ( hop 3 )
define C as ( hop 4 )
define D as ( hop 5 )
define E as ( test hop 2 not 'e' )
define F as ( test hop 3 not 'e' )
define G as ( test hop 3 'f' )
define H as ( test hop 2 't' or 'll' )
define I as ( test hop 2 not 'o' not 'e' )
define J as ( test hop 2 not 'a' not 'e' )
define K as ( test hop 3 'l' or 'i' or ('e' next 'u') )
define L as ( test hop 2 not 'u' not 'x' not ('s' not 'o') )
define M as ( test hop 2 not 'a' not 'c' not 'e' not 'm' )
define N as ( test hop 3 ( hop 2 not 's' or hop 2 ) )
define O as ( test hop 2 'l' or 'i' )
define P as ( test hop 2 not 'c' )
define Q as ( test hop 2 test hop 3 not 'l' not 'n' )
define R as ( test hop 2 'n' or 'r' )
define S as ( test hop 2 'dr' or ('t' not 't') )
define T as ( test hop 2 's' or ('t' not 'o') )
define U as ( test hop 2 'l' or 'm' or 'n' or 'r' )
define V as ( test hop 2 'c' )
define W as ( test hop 2 not 's' not 'u' )
define X as ( test hop 2 'l' or 'i' or ('e' next 'u') )
define Y as ( test hop 2 'in' )
define Z as ( test hop 2 not 'f' )
define AA as ( test hop 2 among ( 'd' 'f' 'ph' 'th' 'l' 'er' 'or'
'es' 't' ) )
define BB as ( test hop 3 not 'met' not 'ryst' )
define CC as ( test hop 2 'l' )
/* The system of endings, as given in Appendix A. */
define endings as (
[substring] among(
'alistically' B 'arizability' A 'izationally' B
'antialness' A 'arisations' A 'arizations' A 'entialness' A
'allically' C 'antaneous' A 'antiality' A 'arisation' A
'arization' A 'ationally' B 'ativeness' A 'eableness' E
'entations' A 'entiality' A 'entialize' A 'entiation' A
'ionalness' A 'istically' A 'itousness' A 'izability' A
'izational' A
'ableness' A 'arizable' A 'entation' A 'entially' A
'eousness' A 'ibleness' A 'icalness' A 'ionalism' A
'ionality' A 'ionalize' A 'iousness' A 'izations' A
'lessness' A
'ability' A 'aically' A 'alistic' B 'alities' A
'ariness' E 'aristic' A 'arizing' A 'ateness' A
'atingly' A 'ational' B 'atively' A 'ativism' A
'elihood' E 'encible' A 'entally' A 'entials' A
'entiate' A 'entness' A 'fulness' A 'ibility' A
'icalism' A 'icalist' A 'icality' A 'icalize' A
'ication' G 'icianry' A 'ination' A 'ingness' A
'ionally' A 'isation' A 'ishness' A 'istical' A
'iteness' A 'iveness' A 'ivistic' A 'ivities' A
'ization' F 'izement' A 'oidally' A 'ousness' A
'aceous' A 'acious' B 'action' G 'alness' A
'ancial' A 'ancies' A 'ancing' B 'ariser' A
'arized' A 'arizer' A 'atable' A 'ations' B
'atives' A 'eature' Z 'efully' A 'encies' A
'encing' A 'ential' A 'enting' C 'entist' A
'eously' A 'ialist' A 'iality' A 'ialize' A
'ically' A 'icance' A 'icians' A 'icists' A
'ifully' A 'ionals' A 'ionate' D 'ioning' A
'ionist' A 'iously' A 'istics' A 'izable' E
'lessly' A 'nesses' A 'oidism' A
'acies' A 'acity' A 'aging' B 'aical' A
'alist' A 'alism' B 'ality' A 'alize' A
'allic'BB 'anced' B 'ances' B 'antic' C
'arial' A 'aries' A 'arily' A 'arity' B
'arize' A 'aroid' A 'ately' A 'ating' I
'ation' B 'ative' A 'ators' A 'atory' A
'ature' E 'early' Y 'ehood' A 'eless' A
'elity' A 'ement' A 'enced' A 'ences' A
'eness' E 'ening' E 'ental' A 'ented' C
'ently' A 'fully' A 'ially' A 'icant' A
'ician' A 'icide' A 'icism' A 'icist' A
'icity' A 'idine' I 'iedly' A 'ihood' A
'inate' A 'iness' A 'ingly' B 'inism' J
'inity'CC 'ional' A 'ioned' A 'ished' A
'istic' A 'ities' A 'itous' A 'ively' A
'ivity' A 'izers' F 'izing' F 'oidal' A
'oides' A 'otide' A 'ously' A
'able' A 'ably' A 'ages' B 'ally' B
'ance' B 'ancy' B 'ants' B 'aric' A
'arly' K 'ated' I 'ates' A 'atic' B
'ator' A 'ealy' Y 'edly' E 'eful' A
'eity' A 'ence' A 'ency' A 'ened' E
'enly' E 'eous' A 'hood' A 'ials' A
'ians' A 'ible' A 'ibly' A 'ical' A
'ides' L 'iers' A 'iful' A 'ines' M
'ings' N 'ions' B 'ious' A 'isms' B
'ists' A 'itic' H 'ized' F 'izer' F
'less' A 'lily' A 'ness' A 'ogen' A
'ward' A 'wise' A 'ying' B 'yish' A
'acy' A 'age' B 'aic' A 'als'BB
'ant' B 'ars' O 'ary' F 'ata' A
'ate' A 'eal' Y 'ear' Y 'ely' E
'ene' E 'ent' C 'ery' E 'ese' A
'ful' A 'ial' A 'ian' A 'ics' A
'ide' L 'ied' A 'ier' A 'ies' P
'ily' A 'ine' M 'ing' N 'ion' Q
'ish' C 'ism' B 'ist' A 'ite'AA
'ity' A 'ium' A 'ive' A 'ize' F
'oid' A 'one' R 'ous' A
'ae' A 'al'BB 'ar' X 'as' B
'ed' E 'en' F 'es' E 'ia' A
'ic' A 'is' A 'ly' B 'on' S
'or' T 'um' U 'us' V 'yl' R
'{'}s' A 's{'}' A
'a' A 'e' A 'i' A 'o' A
's' W 'y' B
(delete)
)
)
/* Undoubling is rule 1 of appendix C. */
define undouble as (
test substring among ('bb' 'dd' 'gg' 'll' 'mm' 'nn' 'pp' 'rr' 'ss'
'tt')
[next] delete
)
/* The other appendix C rules can be done together. */
define respell as (
[substring] among (
'iev' (<-'ief')
'uct' (<-'uc')
'umpt' (<-'um')
'rpt' (<-'rb')
'urs' (<-'ur')
'istr' (<-'ister')
'metr' (<-'meter')
'olv' (<-'olut')
'ul' (not 'a' not 'i' not 'o' <-'l')
'bex' (<-'bic')
'dex' (<-'dic')
'pex' (<-'pic')
'tex' (<-'tic')
'ax' (<-'ac')
'ex' (<-'ec')
'ix' (<-'ic')
'lux' (<-'luc')
'uad' (<-'uas')
'vad' (<-'vas')
'cid' (<-'cis')
'lid' (<-'lis')
'erid' (<-'eris')
'pand' (<-'pans')
'end' (not 's' <-'ens')
'ond' (<-'ons')
'lud' (<-'lus')
'rud' (<-'rus')
'her' (not 'p' not 't' <-'hes')
'mit' (<-'mis')
'ent' (not 'm' <-'ens')
/* 'ent' was 'end' in the 1968 paper - a typo. */
'ert' (<-'ers')
'et' (not 'n' <-'es')
'yt' (<-'ys')
'yz' (<-'ys')
)
)
)
define stem as (
backwards (
do endings
do undouble
do respell
)
)
|