Re: [Snowball-discuss] Slovene stemmer

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Tue Apr 19 2005 - 14:56:55 BST


Bostan,

I have, after a long absence, come back to the Snowball work and have been
looking at your stemmer. As promised, I have rewritten it to make proper use
of amongs. Here is the result, very much smaller and very much faster,

integers (
    p1
)

groupings (
    samoglasniki
    crke
    soglasniki
)

stringescapes {}

/* special characters (in ISO-8859-2) */

stringdef sv hex 'B9' // s-hacek
stringdef cv hex 'E8' // c-hacek
stringdef zv hex 'BE' // z-hacek

define crke 'abc{cv}defghijklmnoprs{sv}tuvz{zv}'
define samoglasniki 'aeiou'
define soglasniki crke - samoglasniki

externals (
    stem
)

define stem as (
    $p1 = limit

    backwards (
        do loop 4 (
            try ($p1>8
                [substring] among ('ovski' 'evski' 'anski' (delete))
)
            try ($p1>7
                [substring] among ('stvo' '{sv}tvo' (delete))
)
            $p1 = size
            try ($p1>6
                [substring] among (
                    '{sv}en' 'ski' '{cv}ek' 'ovm' 'ega' 'ovi' 'ijo' 'ija'
                    'ema' 'ste' 'ejo' 'ite' 'ila' '{sv}{cv}e' '{sv}ki'
                    'ost' 'ast' 'len' 'ven' 'vna' '{cv}an' 'iti' (delete))
)
            $p1 = size
            try ($p1>6
                [substring] among (
                    'al' 'ih' 'iv' 'eg' 'ja' 'je' 'em' 'en' 'ev' 'ov' 'jo'
                    'ma' 'mi' 'eh' 'ij' 'om' 'do' 'o{cv}' 'ti' 'il' 'ec'
                    'ka' 'in' 'an' 'at' 'ir' (delete))
)
            $p1 = size
            try ($p1>5
                [substring] among ('{sv}' 'm' 'c' 'a' 'e' 'i' 'o' 'u'
                    (delete))
)
            $p1 = size
            try (($p1>6) (
                [soglasniki] test soglasniki delete
                )
)
            $p1 = size
            try ($p1>5
                [substring] among ('a' 'e' 'i' 'o' 'u' (delete))
)
        )
    )
)

I have also assembled a Slovene vocabulary to try it out.

Now I can see the structure of the stemmer, I am surprised that it repeats
the suffix removal cycle four times. I notice that if I change 4 to 3, I get
a different result. I know this is not always an easy question to answer,
but can this be related to Slovene morphology in any way? The various
measures 8, 7, 6 etc applied to p1, were, I assume, arrived at by
experiment. Do you think using syllable measurement (as in the other
stemmers) might improve the result?

There are a few things I must ask you about. Much of the stemming looks very
nice. For example,

telovadbe telovad
telovadcem telovad
telovadcev telovad
telovadi telovad
telovadil telovad
telovaditi telovad
telovadne telovad
telovadni telovad
telovadno telovad
telovnik telovnik
tem tem
tema tema
temacna tema
temacni tema
temacno tema

But I am concerned that, with the character count approach, and the 'loop
4', the residual stems are very short. The following illustrates this,

sloven slo
slovenca slo
slovence slo
slovencem slo
slovencev slo
slovenci slo
slovencih slo
slovenec slo
slovenija slov
sloveniji slo
slovenijo slov
slovenko slo
slovenska slo
slovenske slo
slovenskega slo
slovenskem slo
slovenskemu slo
slovenski slov
slovenskih slo
slovenskim slo
slovenskimi slo
slovensko slo
slovenstva slo
slovenšcina slo
slovenšcini slo
slovenšcino slo
slovenščina slo
slovenščini slo
slovenščino slo

Would not sloven (or slov), be a more desirable stem in this case?

Another point. I notice a common -ah suffix, which you have not removed, as
for example here,

besed besed
beseda besed
besedah besedah <------------
besedam besed
besedami besed
besede besed
besedi besed
besedice besed
besedico besed
besedila besed
besedilmiran besed
besedilo besed
besedno besed

Could this be added to the list of suffixes?

Martin Porter



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST