[Snowball-discuss] A problem with replacing letters

From: A. Tordai (atordai@science.uva.nl)
Date: Thu Jan 20 2005 - 16:40:46 GMT


Hello,

I'm working on a Hungarian stemmer and I have a problem I haven't been
able to solve. The code is added below. I have a routine called
v_ending which replaces "a acute" and "e acute" by "a" and "e". If I
simply delete them it works but when I actually try replacing instead of
an "a" I get an "a acute".
For instance if I test it on the word "hagyásában" I ought to get
"hagyása" (with ban removed and a acute replaced) but I get "hagyásá".
Similar things happen with a word like "kimenetelében". I suspect I am
missing something simple but I just can't figure out what goes wrong.

Thank you

Anna Tordai

**************************

// Hungarian stemmer.

routines (
           mark_regions
       v_ending
       R1
       R2
           case
)

externals ( stem )

integers ( p1 p2 )
groupings ( v )

stringescapes {}

/* special characters (in ISO Latin I) */

stringdef a' hex 'E1' // a-acute
stringdef e' hex 'E9' //e-acute
stringdef i' hex 'ED' //i-acute
stringdef o' hex 'F3' //o-acute
stringdef o" hex 'F6' //o-umlaut
stringdef oq hex 'F5' //o-double acute
stringdef u' hex 'FA' //u-acute
stringdef u" hex 'FC' //u-umlaut
stringdef uq hex 'FB' //u-double acute

//vowels
define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}'

define mark_regions as (

    $p1 = limit
    $p2 = limit

    (gopast v (test substring among('cs' 'gy' 'sz' 'ty') setmark p1)) or
    (goto v gopast non-v setmark p1)
    goto v gopast non-v setmark p2
)

backwardmode (

    define R1 as $p1 <= cursor
    define R2 as $p2 <= cursor
   
    define v_ending as (
        [substring] among(
        '{a'}' (<- 'a')
        '{e'}' (<- 'e')
    )
    )

    define case as (
      [substring] among(
            'ban' //inessive
        'ben' //inessive
        )
     delete
    v_ending
    )
)

define stem as (
    do mark_regions
    backwards (
        do case
    )
)



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST