Stemming early English




 

Links to resources

Snowball main page
The English (Porter2) stemmer in Snowball
The Porter stemmer in Snowball
The ‘official’ home page of the Porter stemming algorithm
demo of stemmed est, eth endings

The question occasionally arises of how far the English (or earlier Porter) stemming algorithm can be adapted to handle older forms of the English language.

Historically, English is usually divided into three periods of development,
1) Old English (or Anglo-Saxon), the language of Beowulf,
2) Middle English, the language of Chaucer,
3) Modern English, the language of Shakespeare, Dickens, and people today.
Old English is so different from Modern English that it may be regarded as a distinct language.

Middle English is problematical for a number of reasons. There is no standard spelling in the original texts, and the grammatical differences between Middle and Modern English prevent the spelling from being simply ‘modernised’. It is however possible to normalise the spelling according to some modern scheme, but again there is no standard modern scheme. Middle English itself had great regional variations, so that for example the English of Chaucer and his contemporary the Gawain poet (both late 14th century) are strikingly different. Finally, grammar was fluid even for one writer, so Chaucer might use they love or they loven, he sitteth or he sit.

We may take Modern English to mean English which can be cast into a modern spelling form without too much damage being done to the original. From this point of view Shakespeare and the Authorised Version of the Bible are in Modern English. The ending structure of words in early Modern English differ from contemporary English in the est and eth endings of verbs in the present indicative,
I bring
thou bringest
he bringeth
we bring
you bring
they bring
Both of these endings underwent rapid decline. The eth form occurs in Shakespeare, but is much rarer than the modern s form. The language of the Authorised Version, in which both forms abound, seemed archaic even on its first publication. Consequently the eth form survives now only in the language of the traditional Bible and Book of Common Prayer. The est form disappeared more slowly, as the use of thou became displaced by you in conversation.

To put the endings into the Porter stemmer, the rules
Step 1b
(m>0) EED -> EE
(*v*) ED ->
(*v*) ING ->
should be extended to
Step 1b
(m>0) EED -> EE
(*v*) ED ->
(*v*) ING ->
(*v*) EST ->
(*v*) ETH ->
And to put the endings into the English stemmer, the list
ed   edly   ing   ingly
of Step 1b should be extended to
ed   edly   ing   ingly   est   eth
As far as the Snowball scripts are concerned, the endings  'est' 'eth'  must be added against ending  'ing'.

The inclusion of these endings does produce certain ‘side effects’. est is the ending of adjectival superlatives (greatest, unkindest), where it will also be removed. Words like brandreth, deforest will be mis-stemmed. Nevertheless, for the vocabulary of the Bible, the inclusion of these extra endings is not harmful (see this demonstration — for example, search for the text love in 1000 verses).