Marking vowels as consonants
Some of the algorithms begin with a step which puts letters which are
normally classed as vowels into upper case to indicate that they are are to be
treated as consonants (the assumption being that the words are presented to
the stemmers in lower case). Upper case therefore acts as a flag indicating a
consonant.
For example, the English stemmer begins with the step
-
Set initial y, or y after a vowel, to Y,
giving rise to the following changes,
youth | | -> | | Youth
| boy | | -> | | boY
| boyish | | -> | | boYish
| fly | | -> | | fly
| flying | | -> | | flying
| syzygy | | -> | | syzygy
|
This process works from left to right, and
if a word contains Vyy, where V is a vowel, the first y is put
into upper case, but the second y is left alone, since it is preceded by
upper case Y which is a consonant. A sequence Vyyyyy... would be
changed to VYyYyY....
The combination yy never occurs in English, although it might appear in
foreign words:
(A sayyid, my dictionary tells me, is a descendant of Mohammed's daughter
Fatima.) But the left-to-right process is significant in other languages, for
example French. In French the rule for marking vowels as consonants is,
-
Put into upper case u or i preceded and followed by a vowel, and
y preceded or followed by a vowel. Put u after q into upper
case.
which gives rise to,
ennuie | | -> | | ennuIe
| inquiétude | | -> | | inqUiétude
|
In the first word, i is put into upper case since it has a vowel on both
sides of it.
In the second word, u after q is put into upper case, and again the
following i is left alone, since it is preceded by upper case U which
is a consonant.
|