The Hungarian stemming algorithm



Contributed by Anna Tordai University of Amsterdam

 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample Hungarian vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Tar-gzipped file of all of the above

A stop word list

The isla, Amsterdam page for the Hungarian stemmer




Here is a sample of vocabulary, with the stemmed forms that will be generated with the algorithm.

word stem          word stem
babaháznak
babakocsi
babakocsijáért
babakocsit
babakocsiért
babból
bab
babgulyás
babgulyást
babona
babonákkal
babonás
babrálgatta
babrálni
babrál
babrált
babrálva
babusgatnak
baba
babái
babák
babákkal
babázni
babérfa
babérokat
babért
bacchánsnõk
badacsonyi
badarság
badarságok
baedeker
baglyokat
bagolyszemüveges
bagót
bajbajutott
bajbajutottak
bajbajutottakat
bajbajutottakon
bajlódjanak
bajlódni
  =>   babaház
babakocs
babakocs
babakocs
babakocs
bab
bab
babgulyás
babgulyás
babon
babona
babonás
babrálgatt
babráln
babrál
babrál
babrálv
babusgat
ba
baba
baba
baba
babázn
babérf
babér
bab
bacchánsnõ
badacsony
badarság
badarság
baedeker
bagly
bagolyszemüveges
bagó
bajbajutot
bajbajutott
bajbajutott
bajbajutott
bajlód
bajlódn
muattta
mukkot
mulandóság
mulandóságot
mulasszátok
mulasztanak
mulasztotta
mulasztottam
mulasztották
mulaszt
mulaszthatom
mulasztás
mulasztásban
mulasztásból
mulasztásnál
mulasztással
mulasztásának
mulasztásánál
mulasztásáért
mulasztási
mulasztásos
mulasztó
mulathatnánk
mulathattunk
mulatna
mulat
mulatnak
mulatni
mulattak
mulattat
mulattatta
mulatott
mulatozott
mulatozáshoz
mulatozást
mulatság
mulatságnak
mulatságot
mulatságos
mulatt
  =>   muattt
muk
mulandóság
mulandóság
mulasszát
mulaszt
mulasztott
mulasztott
mulasztotta
mulasz
mulaszthat
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztás
mulasztásos
mulasztó
mulathatna
mulathatt
mulatn
mul
mulat
mulatn
mulatt
mulatt
mulattatt
mulatot
mulatozot
mulatozás
mulatozás
mulatság
mulatság
mulatság
mulatságos
mulat


This stemming algorithm removes the inflectional suffixes of nouns. Nouns are inflected for case, person/possession and number.

Letters in Hungarian include the following accented forms,
á   é   í   ó   ö   õ   ú   ü   û
The following letters are vowels:
a   á   e   é   i   í   o   ó   ö   õ   u   ú   ü   û
The following letters are digraphs:
cs   dz   dzs   gy   ly   ny   ty   zs
A double consonant is defined as:
bb   cc   ccs   dd   ff   gg   ggy   jj   kk   ll   lly   mm   nn   nny   pp   rr   ss   ssz   tt   tty   vv   zz   zzs
If the word begins with a vowel, R1 is defined as the region after the first consonant or digraph in the word. If the word begins with a consonant, it is defined as the region after the first vowel in the word. If the word does not contain both a vowel and consonant, R1 is the null regian at the end of the word.

For example:
    t ó b a n           consonant-vowel
       |.....|          R1 is 'a b a n'

    a b l a k a n       vowel-consonant
       |.........|      R1 is 'l a k a n'

    a c s o n y         vowel-digraph
         |.....|        R1 is 'o n y'

    c v s
     --->|<---          null R1 region
‘Delete if in R1’ means that the suffix should be removed if it is in region R1 but not if it is outside.

Do steps 1 to 9 in turn

Step 1: Remove instrumental case
Search for one of the following suffixes and perform the action indicated.

al   el
delete if in R1 and preceded by a double consonant, and remove one of the double consonants. (In the case of consonant plus digraph, such as ccs, remove a c).
Step 2: Remove frequent cases
Search for the longest among the following suffixes and perform the action indicated.

ban   ben   ba   be   ra   re   nak   nek   val   vel   tól   tõl   ról   rõl   ból   bõl   hoz   hez   höz   nál   nél   ig   at   et   ot   öt   ért   képp   képpen   kor   ul   ül   vá   vé   onként   enként   anként   ként   en   on   an   ön   n   t
delete if in R1
if the remaining word ends á replace by a
if the remaining word ends é replace by e
Step 3: Remove special cases:
Search for the longest among the following suffixes and perform the action indicated.

án   ánként
replace by a if in R1

én
replace by e if in R1
Step 4: Remove other cases:
Search for the longest among the following suffixes and perform the action indicated

astul   estül   stul   stül
delete if in R1

ástul
replace with a if in R1

éstül
replace with e if in R1
Step 5: Remove factive case
Search for one of the following suffixes and perform the action indicated.

á   é
delete if in R1 and preceded by a double consonant, and remove one of the double consonants (as in step 1).
Step 6: Remove owned
Search for the longest among the following suffixes and perform the action indicated.

oké   öké   aké   eké   ké   éi   é
delete if in R1

áké   áéi
replace with a if in R1

éké   ééi   éé
replace with e if in R1
Step 7: Remove singular owner suffixes
Search for the longest among the following suffixes and perform the action indicated.

ünk   unk   nk   juk   jük   uk   ük   em   om   am   m   od   ed   ad   öd   d   ja   je   a   e o
delete if in R1

ánk ájuk ám ád á
replace with a if in R1

énk éjük ém éd é
replace with e if in R1
Step 8: Remove plural owner suffixes
Search for the longest among the following suffixes and perform the action indicated.

jaim   jeim   aim   eim   im   jaid   jeid   aid   eid   id   jai   jei   ai   ei   i   jaink   jeink   eink   aink   ink   jaitok   jeitek   aitok   eitek   itek   jeik   jaik   aik   eik   ik
delete if in R1

áim   áid   ái   áink   áitok   áik
replace with a if in R1

éim   éid     éi   éink   éitek   éik
replace with e if in R1
Step 9: Remove plural suffixes
Search for the longest among the following suffixes and perform the action indicated.

ák
replace with a if in R1

ék
replace with e if in R1

ök   ok   ek   ak   k
delete if in R1



 

The full algorithm in Snowball


/* Hungarian Stemmer Removes noun inflections */ routines ( mark_regions R1 v_ending case case_special case_other plural owned sing_owner plur_owner instrum factive undouble double ) externals ( stem ) integers ( p1 ) groupings ( v ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a' hex 'E1' //a-acute stringdef e' hex 'E9' //e-acute stringdef i' hex 'ED' //i-acute stringdef o' hex 'F3' //o-acute stringdef o" hex 'F6' //o-umlaut stringdef oq hex 'F5' //o-double acute stringdef u' hex 'FA' //u-acute stringdef u" hex 'FC' //u-umlaut stringdef uq hex 'FB' //u-double acute define v 'aeiou{a'}{e'}{i'}{o'}{o"}{oq}{u'}{u"}{uq}' define mark_regions as ( $p1 = limit (v goto non-v among('cs' 'gy' 'ly' 'ny' 'sz' 'ty' 'zs' 'dzs') or next setmark p1) or (non-v gopast v setmark p1) ) backwardmode ( define R1 as $p1 <= cursor define v_ending as ( [substring] R1 among( '{a'}' (<- 'a') '{e'}' (<- 'e') ) ) define double as ( test among('bb' 'cc' 'ccs' 'dd' 'ff' 'gg' 'ggy' 'jj' 'kk' 'll' 'lly' 'mm' 'nn' 'nny' 'pp' 'rr' 'ss' 'ssz' 'tt' 'tty' 'vv' 'zz' 'zzs') ) define undouble as ( next [hop 1] delete ) define instrum as( [substring] R1 among( 'al' (double) 'el' (double) ) delete undouble ) define case as ( [substring] R1 among( 'ban' 'ben' 'ba' 'be' 'ra' 're' 'nak' 'nek' 'val' 'vel' 't{o'}l' 't{oq}l' 'r{o'}l' 'r{oq}l' 'b{o'}l' 'b{oq}l' 'hoz' 'hez' 'h{o"}z' 'n{a'}l' 'n{e'}l' 'ig' 'at' 'et' 'ot' '{o"}t' '{e'}rt' 'k{e'}pp' 'k{e'}ppen' 'kor' 'ul' '{u"}l' 'v{a'}' 'v{e'}' 'onk{e'}nt' 'enk{e'}nt' 'ank{e'}nt' 'k{e'}nt' 'en' 'on' 'an' '{o"}n' 'n' 't' ) delete v_ending ) define case_special as( [substring] R1 among( '{e'}n' (<- 'e') '{a'}n' (<- 'a') '{a'}nk{e'}nt' (<- 'a') ) ) define case_other as( [substring] R1 among( 'astul' 'est{u"}l' (delete) 'stul' 'st{u"}l' (delete) '{a'}stul' (<- 'a') '{e'}st{u"}l' (<- 'e') ) ) define factive as( [substring] R1 among( '{a'}' (double) '{e'}' (double) ) delete undouble ) define plural as ( [substring] R1 among( '{a'}k' (<- 'a') '{e'}k' (<- 'e') '{o"}k' (delete) 'ak' (delete) 'ok' (delete) 'ek' (delete) 'k' (delete) ) ) define owned as ( [substring] R1 among ( 'ok{e'}' '{o"}k{e'}' 'ak{e'}' 'ek{e'}' (delete) '{e'}k{e'}' (<- 'e') '{a'}k{e'}' (<- 'a') 'k{e'}' (delete) '{e'}{e'}i' (<- 'e') '{a'}{e'}i' (<- 'a') '{e'}i' (delete) '{e'}{e'}' (<- 'e') '{e'}' (delete) ) ) define sing_owner as ( [substring] R1 among( '{u"}nk' 'unk' (delete) '{a'}nk' (<- 'a') '{e'}nk' (<- 'e') 'nk' (delete) '{a'}juk' (<- 'a') '{e'}j{u"}k' (<- 'e') 'juk' 'j{u"}k' (delete) 'uk' '{u"}k' (delete) 'em' 'om' 'am' (delete) '{a'}m' (<- 'a') '{e'}m' (<- 'e') 'm' (delete) 'od' 'ed' 'ad' '{o"}d' (delete) '{a'}d' (<- 'a') '{e'}d' (<- 'e') 'd' (delete) 'ja' 'je' (delete) 'a' 'e' 'o' (delete) '{a'}' (<- 'a') '{e'}' (<- 'e') ) ) define plur_owner as ( [substring] R1 among( 'jaim' 'jeim' (delete) '{a'}im' (<- 'a') '{e'}im' (<- 'e') 'aim' 'eim' (delete) 'im' (delete) 'jaid' 'jeid' (delete) '{a'}id' (<- 'a') '{e'}id' (<- 'e') 'aid' 'eid' (delete) 'id' (delete) 'jai' 'jei' (delete) '{a'}i' (<- 'a') '{e'}i' (<- 'e') 'ai' 'ei' (delete) 'i' (delete) 'jaink' 'jeink' (delete) 'eink' 'aink' (delete) '{a'}ink' (<- 'a') '{e'}ink' (<- 'e') 'ink' 'jaitok' 'jeitek' (delete) 'aitok' 'eitek' (delete) '{a'}itok' (<- 'a') '{e'}itek' (<- 'e') 'itek' (delete) 'jeik' 'jaik' (delete) 'aik' 'eik' (delete) '{a'}ik' (<- 'a') '{e'}ik' (<- 'e') 'ik' (delete) ) ) ) define stem as ( do mark_regions backwards ( do instrum do case do case_special do case_other do factive do owned do sing_owner do plur_owner do plural ) )