Links to resources(A note by Martin Porter.) The Schinke Latin stemming algorithm is described in,
Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.It has the feature that it stems each word to two forms, noun and verb. For example, NOUN VERB ---- ---- aquila aquil aquila portat portat porta portis port porHere (slightly reformatted) are the rules of the stemmer, Unfortunately I was not able to make the rules match the examples given, which led to the following email correspondence,1. (start) 2. Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u', respectively. 3. If the word ends in '-que' then if the word is on the list shown in Figure 4, then write the original word to both the noun-based and verb-based stem dictionaries and go to 8. else remove '-que' [Figure 4 was atque quoque neque itaque absque apsque abusque adaeque adusque denique deque susque oblique peraeque plenisque quandoque quisque quaeque cuiusque cuique quemque quamque quaque quique quorumque quarumque quibusque quosque quasque quotusquisque quousque ubique undique usque uterque utique utroque utribique torque coque concoque contorque detorque decoque excoque extorque obtorque optorque retorque recoque attorque incoque intorque praetorque] 4. Match the end of the word against the suffix list show in Figure 6(a), removing the longest matching suffix, (if any). [Figure 6(a) was -ibus -ius -ae -am -as -em -es -ia -is -nt -os -ud -um -us -a -e -i -o -u] 5. If the resulting stem contains at least two characters then write this stem to the noun-based stem dictionary. 6. Match the end of the word against the suffix list show in Figure 6(b), identifying the longest matching suffix, (if any). [Figure 6(b) was -iuntur-beris -erunt -untur -iunt -mini -ntur -stis -bor -ero -mur -mus -ris -sti -tis -tur -unt -bo -ns -nt -ri -m -r -s -t] If any of the following suffixes are found then convert them as shown: '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i'; '-beris', '-bor', and '-bo' to '-bi'; '-ero' to '-eri' else remove the suffix in the normal way. 7. If the resulting stem contains at least two characters then write this stem to the verb-based stem dictionary. 8. (end) FROM Martin Porter TO Peter Willett ON Mon Sep 10 15:11:51 2001 Re: Stemming algorithms
FROM Peter Willett TO Martin Porter ON Mon Sep 10 20:25:24 2001 Re: Stemming algorithms
Following this, I was unable to contact Schinke, and so the problems have remained unresolved.
The linked zip file contains the stemmer,
generated C version, and sample data.
(The stemmer differes slightly from the version in the email above in that
it assembles the noun- and verb-forms of the stem in a single string with
space separation.)
voc.txt is a sample vocabulary, and joined.txt the vocabulary
joined with the two stemmed forms as three column output.
|