French stemming algorithm


 

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample French vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent in two columns
Tar-gzipped file of all of the above

French stop word list
The stemmer in Snowball — MS DOS Latin I encodings
Romance language stemmers


Here is a sample of French vocabulary, with the stemmed forms that will be generated with this algorithm.

word stem          word stem
continu
continua
continuait
continuant
continuation
continue
continué
continuel
continuelle
continuellement
continuelles
continuels
continuer
continuera
continuerait
continueront
continuez
continuité
continuons
contorsions
contour
contournait
contournant
contourne
contours
contractait
contracté
contractée
contracter
contractés
contractions
contradictoirement
contradictoires
contraindre
contraint
contrainte
contraintes
contraire
contraires
contraria
  =>   continu
continu
continu
continu
continu
continu
continu
continuel
continuel
continuel
continuel
continuel
continu
continu
continu
continu
continu
continu
continuon
contors
contour
contourn
contourn
contourn
contour
contract
contract
contract
contract
contract
contract
contradictoir
contradictoir
contraindr
contraint
contraint
contraint
contrair
contrair
contrari
main
mains
maintenaient
maintenait
maintenant
maintenir
maintenue
maintien
maintint
maire
maires
mairie
mais
maïs
maison
maisons
maistre
maitre
maître
maîtres
maîtresse
maîtresses
majesté
majestueuse
majestueusement
majestueux
majeur
majeure
major
majordome
majordomes
majorité
majorités
mal
malacca
malade
malades
maladie
maladies
maladive
  =>   main
main
mainten
mainten
mainten
mainten
maintenu
maintien
maintint
mair
mair
mair
mais
maï
maison
maison
maistr
maitr
maîtr
maîtr
maîtress
maîtress
majest
majestu
majestu
majestu
majeur
majeur
major
majordom
majordom
major
major
mal
malacc
malad
malad
malad
malad
malad



 

The stemming algorithm

Letters in French include the following accented forms,
â   à   ç   ë   é   ê   è   ï   î   ô   û   ù
The following letters are vowels:
a   e   i   o   u   y   â   à   ë   é   ê   è   ï   î   ô   û   ù
Assume the word is in lower case. Then put into upper case u or i preceded and followed by a vowel, and y preceded or followed by a vowel. u after q is also put into upper case. For example,
jouer -> joUer
ennuie -> ennuIe
yeux -> Yeux
quand -> qUand
(The upper case forms are not then classed as vowels — see note on vowel marking.)

If the word begins with two vowels, RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. (Exceptionally, par, col or tap, at the begining of a word is also taken to define RV as the region to their right.)

For example,
    a i m e r     a d o r e r     v o l e r    t a p i s
         |...|         |.....|       |.....|        |...|
R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)

For example:
    f a m e u s e m e n t
         |......R1.......|
               |...R2....|
Note that R1 can contain RV (adorer), and RV can contain R1 (voler).

Below, ‘delete if in R2’ means that a found suffix should be removed if it lies entirely in R2, but not if it overlaps R2 and the rest of the word. ‘delete if in R1 and preceded by X’ means that X itself does not have to come in R1, while ‘delete if preceded by X in R1’ means that X, like the suffix, must be entirely in R1.

Start with step 1

Step 1: Standard suffix removal
Search for the longest among the following suffixes, and perform the action indicated.

ance   iqUe   isme   able   iste   eux   ances   iqUes   ismes   ables   istes
delete if in R2

atrice   ateur   ation   atrices   ateurs   ations
delete if in R2
if preceded by ic, delete if in R2, else replace by iqU

logie   logies
replace with log if in R2

usion   ution   usions   utions
replace with u if in R2

ence   ences
replace with ent if in R2

ement   ements
delete if in RV
if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
if preceded by eus, delete if in R2, else replace by eux if in R1, otherwise,
if preceded by abl or iqU, delete if in R2, otherwise,
if preceded by ièr or Ièr, replace by i if in RV

ité   ités
delete if in R2
if preceded by abil, delete if in R2, else replace by abl, otherwise,
if preceded by ic, delete if in R2, else replace by iqU, otherwise,
if preceded by iv, delete if in R2

if   ive   ifs   ives
delete if in R2
if preceded by at, delete if in R2 (and if further preceded by ic, delete if in R2, else replace by iqU)

eaux
replace with eau

aux
replace with al if in R1

euse   euses
delete if in R2, else replace by eux if in R1

issement   issements
delete if in R1 and preceded by a non-vowel

amment
replace with ant if in RV

emment
replace with ent if in RV

ment   ments
delete if preceded by a vowel in RV
In steps 2a and 2b all tests are confined to the RV region.

Do step 2a if either no ending was removed by step 1, or if one of endings amment, emment, ment, ments was found.

Step 2a: Verb suffixes beginning i
Search for the longest among the following suffixes and if found, delete if preceded by a non-vowel.

îmes   ît   îtes   i   ie   ies   ir   ira   irai   iraIent   irais   irait   iras   irent   irez   iriez   irions   irons   iront   is   issaIent   issais   issait   issant   issante   issantes   issants   isse   issent   isses   issez   issiez   issions   issons   it

(Note that the non-vowel itself must also be in RV.)
Do step 2b if step 2a was done, but failed to remove a suffix.

Step 2b: Other verb suffixes
Search for the longest among the following suffixes, and perform the action indicated.

ions
delete if in R2

é   ée   ées   és   èrent   er   era   erai   eraIent   erais   erait   eras   erez   eriez   erions   erons   eront   ez   iez
delete

âmes   ât   âtes   a   ai   aIent   ais   ait   ant   ante   antes   ants   as   asse   assent   asses   assiez   assions
delete
if preceded by e, delete

(Note that the e that may be deleted in this last step must also be in RV.)
If the last step to be obeyed — either step 1, 2a or 2b — altered the word, do step 3

Step 3
Replace final Y with i or final ç with c
Alternatively, if the last step to be obeyed did not alter the word, do step 4

Step 4: Residual suffix
If the word ends s, not preceded by a, i, o, u, è or s, delete it.

In the rest of step 4, all tests are confined to the RV region.

Search for the longest among the following suffixes, and perform the action indicated.

ion
delete if in R2 and preceded by s or t

ier   ière   Ier   Ière
replace with i

e
delete

ë
if preceded by gu, delete

(So note that ion is removed only when it is in R2 — as well as being in RV — and preceded by s or t which must be in RV.)
Always do steps 5 and 6.

Step 5: Undouble
If the word ends enn, onn, ett, ell or eill, delete the last letter
Step 6: Un-accent
If the words ends é or è followed by at least one non-vowel, remove the accent from the e.
And finally:
Turn any remaining I, U and Y letters in the word back into lower case.

 

The same algorithm in Snowball


routines ( prelude postlude mark_regions RV R1 R2 standard_suffix i_verb_suffix verb_suffix residual_suffix un_double un_accent ) externals ( stem ) integers ( pV p1 p2 ) groupings ( v keep_with_s ) stringescapes {} /* special characters (in ISO Latin I) */ stringdef a^ hex 'E2' // a-circumflex stringdef a` hex 'E0' // a-grave stringdef c, hex 'E7' // c-cedilla stringdef e" hex 'EB' // e-diaeresis (rare) stringdef e' hex 'E9' // e-acute stringdef e^ hex 'EA' // e-circumflex stringdef e` hex 'E8' // e-grave stringdef i" hex 'EF' // i-diaeresis stringdef i^ hex 'EE' // i-circumflex stringdef o^ hex 'F4' // o-circumflex stringdef u^ hex 'FB' // u-circumflex stringdef u` hex 'F9' // u-grave define v 'aeiouy{a^}{a`}{e"}{e'}{e^}{e`}{i"}{i^}{o^}{u^}{u`}' define prelude as repeat goto ( ( v [ ('u' ] v <- 'U') or ('i' ] v <- 'I') or ('y' ] <- 'Y') ) or ( ['y'] v <- 'Y' ) or ( 'q' ['u'] <- 'U' ) ) define mark_regions as ( $pV = limit $p1 = limit $p2 = limit // defaults do ( ( v v next ) or among ( // this exception list begun Nov 2006 'par' // paris, parie, pari 'col' // colis 'tap' // tapis // extensions possible here ) or ( next gopast v ) setmark pV ) do ( gopast v gopast non-v setmark p1 gopast v gopast non-v setmark p2 ) ) define postlude as repeat ( [substring] among( 'I' (<- 'i') 'U' (<- 'u') 'Y' (<- 'y') '' (next) ) ) backwardmode ( define RV as $pV <= cursor define R1 as $p1 <= cursor define R2 as $p2 <= cursor define standard_suffix as ( [substring] among( 'ance' 'iqUe' 'isme' 'able' 'iste' 'eux' 'ances' 'iqUes' 'ismes' 'ables' 'istes' ( R2 delete ) 'atrice' 'ateur' 'ation' 'atrices' 'ateurs' 'ations' ( R2 delete try ( ['ic'] (R2 delete) or <-'iqU' ) ) 'logie' 'logies' ( R2 <- 'log' ) 'usion' 'ution' 'usions' 'utions' ( R2 <- 'u' ) 'ence' 'ences' ( R2 <- 'ent' ) 'ement' 'ements' ( RV delete try ( [substring] among( 'iv' (R2 delete ['at'] R2 delete) 'eus' ((R2 delete) or (R1<-'eux')) 'abl' 'iqU' (R2 delete) 'i{e`}r' 'I{e`}r' //) (RV <-'i') //)--new 2 Sept 02 ) ) ) 'it{e'}' 'it{e'}s' ( R2 delete try ( [substring] among( 'abil' ((R2 delete) or <-'abl') 'ic' ((R2 delete) or <-'iqU') 'iv' (R2 delete) ) ) ) 'if' 'ive' 'ifs' 'ives' ( R2 delete try ( ['at'] R2 delete ['ic'] (R2 delete) or <-'iqU' ) ) 'eaux' (<- 'eau') 'aux' (R1 <- 'al') 'euse' 'euses'((R2 delete) or (R1<-'eux')) 'issement' 'issements'(R1 non-v delete) // verbal // fail(...) below forces entry to verb_suffix. -ment typically // follows the p.p., e.g 'confus{e'}ment'. 'amment' (RV fail(<- 'ant')) 'emment' (RV fail(<- 'ent')) 'ment' 'ments' (test(v RV) fail(delete)) // v is e,i,u,{e'},I or U ) ) define i_verb_suffix as setlimit tomark pV for ( [substring] among ( '{i^}mes' '{i^}t' '{i^}tes' 'i' 'ie' 'ies' 'ir' 'ira' 'irai' 'iraIent' 'irais' 'irait' 'iras' 'irent' 'irez' 'iriez' 'irions' 'irons' 'iront' 'is' 'issaIent' 'issais' 'issait' 'issant' 'issante' 'issantes' 'issants' 'isse' 'issent' 'isses' 'issez' 'issiez' 'issions' 'issons' 'it' (non-v delete) ) ) define verb_suffix as setlimit tomark pV for ( [substring] among ( 'ions' (R2 delete) '{e'}' '{e'}e' '{e'}es' '{e'}s' '{e`}rent' 'er' 'era' 'erai' 'eraIent' 'erais' 'erait' 'eras' 'erez' 'eriez' 'erions' 'erons' 'eront' 'ez' 'iez' // 'ons' //-best omitted (delete) '{a^}mes' '{a^}t' '{a^}tes' 'a' 'ai' 'aIent' 'ais' 'ait' 'ant' 'ante' 'antes' 'ants' 'as' 'asse' 'assent' 'asses' 'assiez' 'assions' (delete try(['e'] delete) ) ) ) define keep_with_s 'aiou{e`}s' define residual_suffix as ( try(['s'] test non-keep_with_s delete) setlimit tomark pV for ( [substring] among( 'ion' (R2 's' or 't' delete) 'ier' 'i{e`}re' 'Ier' 'I{e`}re' (<-'i') 'e' (delete) '{e"}' ('gu' delete) ) ) ) define un_double as ( test among('enn' 'onn' 'ett' 'ell' 'eill') [next] delete ) define un_accent as ( atleast 1 non-v [ '{e'}' or '{e`}' ] <-'e' ) ) define stem as ( do prelude do mark_regions backwards ( do ( ( ( standard_suffix or i_verb_suffix or verb_suffix ) and try( [ ('Y' ] <- 'i' ) or ('{c,}'] <- 'c' ) ) ) or residual_suffix ) // try(['ent'] RV delete) // is best omitted do un_double do un_accent ) do postlude )