The stemming algorithm
Letters in French include the following accented forms,
-
â à ç ë é ê è ï î ô û ù
The following letters are vowels:
-
a e i o u y â à ë é ê è ï î ô û ù
Assume the word is in lower case. Then put into upper case u or i preceded
and followed by a vowel, and y preceded or followed by a vowel. u after q is
also put into upper case. For example,
jouer | | -> | | joUer
| ennuie | | -> | | ennuIe
| yeux | | -> | | Yeux
| quand | | -> | | qUand
|
(The upper case forms are not then classed as vowels — see note on vowel
marking.)
If the word begins with two vowels, RV is the region after the third
letter, otherwise the region after the first vowel not at the beginning of
the word, or the end of the word if these positions cannot be found. (Exceptionally,
par, col or tap, at the begining of a word is also taken to define
RV as the region to their right.)
For example,
a i m e r a d o r e r v o l e r t a p i s
|...| |.....| |.....| |...|
R1 is the region after the first non-vowel following a vowel, or the end of
the word if there is no such non-vowel.
R2 is the region after the first non-vowel following a vowel in R1, or the
end of the word if there is no such non-vowel.
(See note on R1 and R2.)
For example:
f a m e u s e m e n t
|......R1.......|
|...R2....|
Note that R1 can contain RV (adorer), and RV can contain R1 (voler).
Below, ‘delete if in R2’ means that a found suffix should be removed if it
lies entirely in R2, but not if it overlaps R2 and the rest of the word.
‘delete if in R1 and preceded by X’ means that X itself does not have to
come in R1, while ‘delete if preceded by X in R1’ means that X, like the
suffix, must be entirely in R1.
Start with step 1
Step 1: Standard suffix removal
-
Search for the longest among the following suffixes, and perform the
action indicated.
- ance iqUe isme able iste eux ances iqUes ismes ables istes
- delete if in R2
- atrice ateur ation atrices ateurs ations
- delete if in R2
- if preceded by ic, delete if in R2, else replace by iqU
- logie logies
- replace with log if in R2
- usion ution usions utions
- replace with u if in R2
- ence ences
- replace with ent if in R2
- ement ements
- delete if in RV
- if preceded by iv, delete if in R2 (and if further preceded by at,
delete if in R2), otherwise,
- if preceded by eus, delete if in R2, else replace by eux
if in R1, otherwise,
- if preceded by abl or iqU, delete if in R2, otherwise,
- if preceded by ièr or Ièr, replace by i if in RV
- ité ités
- delete if in R2
- if preceded by abil, delete if in R2, else replace by abl,
otherwise,
- if preceded by ic, delete if in R2, else replace by iqU, otherwise,
- if preceded by iv, delete if in R2
- if ive ifs ives
- delete if in R2
- if preceded by at, delete if in R2 (and if further preceded by ic,
delete if in R2, else replace by iqU)
- eaux
- replace with eau
- aux
- replace with al if in R1
- euse euses
- delete if in R2, else replace by eux if in R1
- issement issements
- delete if in R1 and preceded by a non-vowel
- amment
- replace with ant if in RV
- emment
- replace with ent if in RV
- ment ments
- delete if preceded by a vowel in RV
In steps 2a and 2b all tests are confined to the RV region.
Do step 2a if either no ending was removed by step 1, or if one of endings
amment, emment, ment, ments was found.
Step 2a: Verb suffixes beginning i
-
Search for the longest among the following suffixes and if found,
delete if preceded by a non-vowel.
-
îmes ît îtes i ie ies ir ira irai iraIent irais irait iras
irent irez iriez irions irons iront is issaIent issais issait
issant issante issantes issants isse issent isses issez issiez
issions issons it
(Note that the non-vowel itself must also be in RV.)
Do step 2b if step 2a was done, but failed to remove a suffix.
Step 2b: Other verb suffixes
-
Search for the longest among the following suffixes, and perform the
action indicated.
- ions
- delete if in R2
- é ée ées és èrent er era erai eraIent erais erait eras erez
eriez erions erons eront ez iez
- delete
- âmes ât âtes a ai aIent ais ait ant ante antes ants as asse
assent asses assiez assions
- delete
- if preceded by e, delete
(Note that the e that may be deleted in this last step must also be in
RV.)
If the last step to be obeyed — either step 1, 2a or 2b — altered the word,
do step 3
Step 3
-
Replace final Y with i or final ç with c
Alternatively, if the last step to be obeyed did not alter the word, do
step 4
Step 4: Residual suffix
-
If the word ends s, not preceded by a, i, o, u, è or s, delete it.
In the rest of step 4, all tests are confined to the RV region.
Search for the longest among the following suffixes, and perform the
action indicated.
- ion
- delete if in R2 and preceded by s or t
- ier ière Ier Ière
- replace with i
- e
- delete
- ë
- if preceded by gu, delete
(So note that ion is removed only when it is in R2 — as well as being
in RV — and preceded by s or t which must be in RV.)
Always do steps 5 and 6.
Step 5: Undouble
-
If the word ends enn, onn, ett, ell or eill, delete the last letter
Step 6: Un-accent
-
If the words ends é or è followed by at least one non-vowel, remove
the accent from the e.
And finally:
-
Turn any remaining I, U and Y letters in the word back into lower case.
|