The Hungarian stemming algorithm

Contributed by Anna Tordai University of Amsterdam

Links to resources

Snowball main page

The stemmer in Snowball

The ANSI C stemmer

— and its header

Sample Hungarian vocabulary

Its stemmed equivalent

Vocabulary + stemmed equivalent

Tar-gzipped file of all of the above

A stop word list

The isla, Amsterdam page for the Hungarian stemmer

Here is a sample of vocabulary, with the stemmed forms that will be generated with the algorithm.

word stem word stem

babaháznak
babakocsi
babakocsijáért
babakocsit
babakocsiért
babból
bab
babgulyás
babgulyást
babona
babonákkal
babonás
babrálgatta
babrálni
babrál
babrált
babrálva
babusgatnak
baba
babái
babák
babákkal
babázni
babérfa
babérokat
babért
bacchánsnõk
badacsonyi
badarság
badarságok
baedeker
baglyokat
bagolyszemüveges
bagót
bajbajutott
bajbajutottak
bajbajutottakat
bajbajutottakon
bajlódjanak
bajlódni => babaház
babakocs
babakocs
babakocs
babakocs
bab
bab
babgulyás
babgulyás
babon
babona
babonás
babrálgatt
babráln
babrál
babrál
babrálv
babusgat
ba
baba
baba
baba
babázn
babérf
babér
bab
bacchánsnõ
badacsony
badarság
badarság
baedeker
bagly
bagolyszemüveges
bagó
bajbajutot
bajbajutott
bajbajutott
bajbajutott
bajlód
bajlódn muattta
mukkot
mulandóság
mulandóságot
mulasszátok
mulasztanak
mulasztotta
mulasztottam
mulasztották
mulaszt
mulaszthatom
mulasztás
mulasztásban
mulasztásból
mulasztásnál
mulasztással
mulasztásának
mulasztásánál
mulasztásáért
mulasztási
mulasztásos
mulasztó
mulathatnánk
mulathattunk
mulatna
mulat
mulatnak
mulatni
mulattak
mulattat
mulattatta
mulatott
mulatozott
mulatozáshoz
mulatozást
mulatság
mulatságnak
mulatságot
mulatságos
mulatt => muattt
muk
mulandóság
mulandóság
mulasszát
mulaszt
mulasztott
mulasztott
mulasztotta
mulasz
mulaszthat
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztásos
mulasztó
mulathatna
mulathatt
mulatn
mul
mulat
mulatn
mulatt
mulatt
mulattatt
mulatot
mulatozot
mulatozás
mulatozás
mulatság
mulatság
mulatság
mulatságos
mulat

This stemming algorithm removes the inflectional suffixes of nouns. Nouns are inflected for case, person/possession and number.

Letters in Hungarian include the following accented forms,

á é í ó ö õ ú ü û

The following letters are vowels:

a á e é i í o ó ö õ u ú ü û

The following letters are digraphs:

cs dz dzs gy ly ny ty zs

A double consonant is defined as:

bb cc ccs dd ff gg ggy jj kk ll lly mm nn nny pp rr ss ssz tt tty vv zz zzs

If the word begins with a vowel, R1 is defined as the region after the first consonant or digraph in the word. If the word begins with a consonant, it is defined as the region after the first vowel in the word. If the word does not contain both a vowel and consonant, R1 is the null regian at the end of the word.

For example:

    t ó b a n           consonant-vowel
       |.....|          R1 is 'a b a n'

    a b l a k a n       vowel-consonant
       |.........|      R1 is 'l a k a n'

    a c s o n y         vowel-digraph
         |.....|        R1 is 'o n y'

    c v s
     --->|<---          null R1 region

‘Delete if in R1’ means that the suffix should be removed if it is in region R1 but not if it is outside.

Do steps 1 to 9 in turn

Step 1: Remove instrumental case

Search for one of the following suffixes and perform the action indicated.

al el: delete if in R1 and preceded by a double consonant, and remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).

Step 2: Remove frequent cases

Search for the longest among the following suffixes and perform the action indicated.

ban ben ba be ra re nak nek val vel tól tõl ról rõl ból bõl hoz hez höz nál nél ig at et ot öt ért képp képpen kor ul ül vá vé onként enként anként ként en on an ön n t: delete if in R1; if the remaining word ends á replace by a; if the remaining word ends é replace by e

Step 3: Remove special cases:

Search for the longest among the following suffixes and perform the action indicated.

án ánként: replace by a if in R1
én: replace by e if in R1

Step 4: Remove other cases:

Search for the longest among the following suffixes and perform the action indicated

astul estül stul stül: delete if in R1
ástul: replace with a if in R1
éstül: replace with e if in R1

Step 5: Remove factive case

Search for one of the following suffixes and perform the action indicated.

á é: delete if in R1 and preceded by a double consonant, and remove one of the double consonants (as in step 1).

Step 6: Remove owned

Search for the longest among the following suffixes and perform the action indicated.

oké öké aké eké ké éi é: delete if in R1
áké áéi: replace with a if in R1
éké ééi éé: replace with e if in R1

Step 7: Remove singular owner suffixes

Search for the longest among the following suffixes and perform the action indicated.

ünk unk nk juk jük uk ük em om am m od ed ad öd d ja je a e o: delete if in R1
ánk ájuk ám ád á: replace with a if in R1
énk éjük ém éd é: replace with e if in R1

Step 8: Remove plural owner suffixes

Search for the longest among the following suffixes and perform the action indicated.

jaim jeim aim eim im jaid jeid aid eid id jai jei ai ei i jaink jeink eink aink ink jaitok jeitek aitok eitek itek jeik jaik aik eik ik: delete if in R1
áim áid ái áink áitok áik: replace with a if in R1
éim éid éi éink éitek éik: replace with e if in R1

Step 9: Remove plural suffixes

Search for the longest among the following suffixes and perform the action indicated.

ák: replace with a if in R1
ék: replace with e if in R1
ök ok ek ak k: delete if in R1

The full algorithm in Snowball


/*
Hungarian Stemmer
Removes noun inflections
*/

routines (
    mark_regions
    R1
    v_ending
    case
    case_special
    case_other
    plural
    owned
    sing_owner
    plur_owner
    instrum
    factive
    undouble
    double
)

externals ( stem )

integers ( p1 )
groupings ( v )

stringescapes {}

/* special characters (in ISO Latin I) */

stringdef a'  hex 'E1'  //a-acute
stringdef e'  hex 'E9'  //e-acute
stringdef i'  hex 'ED'  //i-acute
stringdef o'  hex 'F3'  //o-acute
stringdef o"  hex 'F6'  //o-umlaut
stringdef oq  hex 'F5'  //o-double acute
stringdef u'  hex 'FA'  //u-acute
stringdef u"  hex 'FC'  //u-umlaut
stringdef uq  hex 'FB'  //u-double acute

define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}'

define mark_regions as (

    $p1 = limit

    (v goto non-v
     among('cs' 'gy' 'ly' 'ny' 'sz' 'ty' 'zs' 'dzs') or next
     setmark p1)
    or

    (non-v gopast v setmark p1)
)

backwardmode (

    define R1 as $p1 <= cursor

    define v_ending as (
        [substring] R1 among(
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define double as (
        test among('bb' 'cc' 'ccs' 'dd' 'ff' 'gg' 'ggy' 'jj' 'kk' 'll' 'lly' 'mm'
        'nn' 'nny' 'pp' 'rr' 'ss' 'ssz' 'tt' 'tty' 'vv' 'zz' 'zzs')
    )

    define undouble as (
        next [hop 1] delete
    )

    define instrum as(
        [substring] R1 among(
            'al' (double)
            'el' (double)
        )
        delete
        undouble
    )


    define case as (
        [substring] R1 among(
            'ban' 'ben'
            'ba' 'be'
            'ra' 're'
            'nak' 'nek'
            'val' 'vel'
            't{o'}l' 't{oq}l'
            'r{o'}l' 'r{oq}l'
            'b{o'}l' 'b{oq}l'
            'hoz' 'hez' 'h{o"}z'
            'n{a'}l' 'n{e'}l'
            'ig'
            'at' 'et' 'ot' '{o"}t'
            '{e'}rt'
            'k{e'}pp' 'k{e'}ppen'
            'kor'
            'ul' '{u"}l'
            'v{a'}' 'v{e'}'
            'onk{e'}nt' 'enk{e'}nt' 'ank{e'}nt'
            'k{e'}nt'
            'en' 'on' 'an' '{o"}n'
            'n'
            't'
        )
        delete
        v_ending
    )

    define case_special as(
        [substring] R1 among(
            '{e'}n' (<- 'e')
            '{a'}n' (<- 'a')
            '{a'}nk{e'}nt' (<- 'a')
        )
    )

    define case_other as(
        [substring] R1 among(
            'astul' 'est{u"}l' (delete)
            'stul' 'st{u"}l' (delete)
            '{a'}stul' (<- 'a')
            '{e'}st{u"}l' (<- 'e')
        )
    )

    define factive as(
        [substring] R1 among(
            '{a'}' (double)
            '{e'}' (double)
        )
        delete
        undouble
    )

    define plural as (
        [substring] R1 among(
            '{a'}k' (<- 'a')
            '{e'}k' (<- 'e')
            '{o"}k' (delete)
            'ak' (delete)
            'ok' (delete)
            'ek' (delete)
            'k' (delete)
        )
    )

    define owned as (
        [substring] R1 among (
            'ok{e'}' '{o"}k{e'}' 'ak{e'}' 'ek{e'}' (delete)
            '{e'}k{e'}' (<- 'e')
            '{a'}k{e'}' (<- 'a')
            'k{e'}' (delete)
            '{e'}{e'}i' (<- 'e')
            '{a'}{e'}i' (<- 'a')
            '{e'}i'  (delete)
            '{e'}{e'}' (<- 'e')
            '{e'}' (delete)
        )
    )

    define sing_owner as (
        [substring] R1 among(
            '{u"}nk' 'unk' (delete)
            '{a'}nk' (<- 'a')
            '{e'}nk' (<- 'e')
            'nk' (delete)
            '{a'}juk' (<- 'a')
            '{e'}j{u"}k' (<- 'e')
            'juk' 'j{u"}k' (delete)
            'uk' '{u"}k' (delete)
            'em' 'om' 'am' (delete)
            '{a'}m' (<- 'a')
            '{e'}m' (<- 'e')
            'm' (delete)
            'od' 'ed' 'ad' '{o"}d' (delete)
            '{a'}d' (<- 'a')
            '{e'}d' (<- 'e')
            'd' (delete)
            'ja' 'je' (delete)
            'a' 'e' 'o' (delete)
            '{a'}' (<- 'a')
            '{e'}' (<- 'e')
        )
    )

    define plur_owner as (
        [substring] R1 among(
            'jaim' 'jeim' (delete)
            '{a'}im' (<- 'a')
            '{e'}im' (<- 'e')
            'aim' 'eim' (delete)
            'im' (delete)
            'jaid' 'jeid' (delete)
            '{a'}id' (<- 'a')
            '{e'}id' (<- 'e')
            'aid' 'eid' (delete)
            'id' (delete)
            'jai' 'jei' (delete)
            '{a'}i' (<- 'a')
            '{e'}i' (<- 'e')
            'ai' 'ei' (delete)
            'i' (delete)
            'jaink' 'jeink' (delete)
            'eink' 'aink' (delete)
            '{a'}ink' (<- 'a')
            '{e'}ink' (<- 'e')
            'ink'
            'jaitok' 'jeitek' (delete)
            'aitok' 'eitek' (delete)
            '{a'}itok' (<- 'a')
            '{e'}itek' (<- 'e')
            'itek' (delete)
            'jeik' 'jaik' (delete)
            'aik' 'eik' (delete)
            '{a'}ik' (<- 'a')
            '{e'}ik' (<- 'e')
            'ik' (delete)
        )
    )
)

define stem as (
    do mark_regions
    backwards (
      do instrum
        do case
        do case_special
        do case_other
        do factive
        do owned
        do sing_owner
        do plur_owner
        do plural
    )
)