[Snowball-discuss] French stemmer, Snowball project

From: Martin Porter (martin_porter@softhome.net)
Date: Mon Sep 09 2002 - 12:24:01 BST

[This masseage to Fred Brault ought to be posted on snowball-discuss, as it
give the background to some recent changes to the Snowball site. It was
originally sent 1 Sept 02 - Martin]


I've looked at the algorithm alongside your comments, and can now report back.

Things aren't as bad as I at first thought ...

A) The rule for marking u and i/y as consonants is not explained at all
well, and will have to be improved. The point is that it works from left to
right along the word. So in risquiez (to quote one of your examples), i is
vowel because it is between consonants, u is a consonant because it is after
q, and the following i is a vowel because it is after U (u as a consonant)
and before i. So the word becomes risqUiez, and not, as you have it, risquIez.

If you think about it, the i should not be treated as a consonant in this case.

This accounts for most of the differences you noted.

B) There is a clear slip in the description of the algorithm, which you have
spotted. For 'eus' before 'ement', it should read: "delete if in R2, or
replace by 'eux' if in R1". I will put that right. ('ement' is tricky
because it is a single form for two endings - see the table of endings for
the Romance languages.)

This explains the 'pieusement' example. I realise the algorithm is failing
to equate pieuse with pieux, but that is a short-stem problem (see below).

But the treatment of -ier etc in step 4 is correct. Remember that the test
operates in RV (this is declared at the front of the step). In crier, RV is
just [er], not [ier]. I realise that -er is an ending in crier, but allowing
for very short stems like 'cri' leads to too many errors generally.

This is a problem with all the Romance language stemmers, and the definition
of RV, rather complicated for some of them, is trying to get the balance
just right. It is because there are in all these languages certain verbs
with very short stems. crier, prier, rier etc. I have occasionally tried to
establish lists of such verbs, but it is not easy.

C) Now for your final suggestion, to respell i{e`}r as i{e`}re etc after
ement removal. This doesn't quite work since iere is only removed later if
nothing was done in the step that removed ement, but you can get the effect
by just replacing i{e`r} in RV with i, following ement removal.

This leads to the following pattern of changes:

familier famili
familièrement familier -> famili

financier financi
financièrement financier -> financi

foncièrement foncier -> fonci

grossière grossi
grossièrement grossier -> grossi
grossières grossi

irrégulière irréguli
irrégulièrement irrégulier -> irréguli
irréguliers irréguli

particulière particuli
particulièrement particulier -> particuli
particuliers particuli

premier premi
première premi
premièrement premier -> premi
premières premi
premiers premi

régulier réguli
régulière réguli
régulièrement régulier -> réguli

singulier singuli
singulière singuli
singulièrement singulier -> singuli
singulières singuli
singuliers singuli

- a definite improvement so I will put it in. The change in the Snowball
script is

                try (
                    [substring] among(
'iv' (R2 delete ['at'] R2 delete)
'eus' ((R2 delete) or (R1<-'eux'))
'abl' 'iqU' (R2 delete)
'i{e`}r' <---new
'I{e`}r' <---new
                            (RV <-'i') <---new

The case I{e`}r should be included here, and yet there are no words in the
sample vocabulary that illustrate it. The question is, can you think of a
word ending Vie`rement in French, where V is a vowel!?


I will add (A), (B), (C) in soon. Right now the website is being reorganised
(possibly going to a new server), but when Richard Boulton has finished that
I will put the changes in place.


At 12:19 PM 8/31/02 -0400, FREDERICK BRAULT wrote:
>Content-Type: text/plain; charset="iso-8859-1"

>X-MIME-Autoconverted: from 8bit to quoted-printable by agora.ulaval.ca id
>Dear Mr. Porter, dear Mr. Boulton,
>I implemented the French stemmer that you suggest through the Snowball
>project and it fonctions well indeed. However, I identified little
>inconsistencies that I wanted to share with you to contribute to the
>improvement of the Snowball project.
>I suspect there are few errors in the script that generated the list of
>French words and stems that is provided by the Snowball project in order
>to check the efficiency of any other implementation of the stemmer
>algorithm (or maybe it is just errors in the list itself). The errors
>occur with the suffixes “ier, ière, Ier and Ière” of step 4. According to
>the algorithm, these suffixes should be replaced by “i” but aren’t in the
>list (which is reproduced in the attached file. Open it in Microsoft
>Paint if you can't see it well). In the list below, I
>suspect that these suffixes didn’t work and that the next operation in
>step 4 (with the suffix ‘e’) was carried on and then, by step 6, the
>remaining accent was removed. It is important not to miss step 4
>because then, for example, the masculine “entier” is not associated with
>its feminine counterpart “entière”, as it can be seen in the list
>Another error in the list is the word “pieusement” (also reproduced in
>the attached file) that should give “pieux” by vitue of step 1 in
>the “else” part of the rule of the “ement and ements” suffixes. In the
>list below, the “else” part wasn’t executed and then gave “pieus”.
>I would also suggest to add the suffix “Ie” in step 2a. Because of the
>definition of the vowels, such words as “évanouie” and “réjouie” (not in
>the attached file) give “évanoui” and “ réjoui” which are not grouped
>then with the other words in the same class (évanoui, évanouie, évanouir,
>évanouirent, évanouis, évanouissait, évanouissement, évanouit) that
>give “évanou” and “réjou”. The problem with the actual suffixes is that
>the “u” and “i” get upper cased because they are between vowels
>giving “évanoUIe” and “réjoUIe”. By adding the suffix “Ie” in step 2a,
>the problem is solved.
>Another suggestion is to add the suffix “Iez” in step 2b along with “é,
>ée, ées, és, ... , ez, iez”. Some words like “risquiez, renvoyiez and
>payiez” give “risqUIez, renvoyIez and payIez” because the “i” is between
>vowels. Maybe I got the rules for manipulating the vowels wrong or the
>suffix “Iez” has been forgotten in the description of the algorithm
>because in the “checking” list provided by the Snowball project, the
>words “risqUIez, renvoyIez and payIez” are correctly stemmed.
>Finally, another suggestion, although I am not sure if everybody would
>aggries with it. Let's see! I would suggest to add a rule to step 1,
>about the "ement and ements" suffixes. Here it goes: "if preceded
>by "ièr", replace by "ière" (with no consideration to R1, R2 of RV)". The
>remaining "ière" suffix would also be removed later by step 4. This would
>allow adverbs derived from the feminine adjectives to be together with
>other words that have closed meaning. For example, "premièrement"
>("firstly") would do "premièrement --> "première" --> "premi". This would
>allow the words "premier" ("first", masculine), "première" ("first",
>feminine) and "premièrement" ("firstly") to be grouped together. However,
>as I said, I am not sure if everybody wants the adverbs to be grouped
>with the adjectives and nouns. The actual algorithm separates the
>adjectives and nouns from the adverbs.
>Well, this is it! I hope I didn’t make any mistake myself. The
>corrections I suggest seem to solve the problems to get the right answers
>according to the checking list. However, I don’t know if the corrections
>would cause troubles with other words that aren’t in the list. That would
>have to be verified.
>Fred Brault
>Attachment Converted: C:\EUDORA\ATTACH\3.gif

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST