No, I meant it is better, because in utf-8 encoding, the first character is
often "constant" in a particular alphabet.
(One must not be too influenced by the range of current stemmers --- new one
slowly trickle in.)
I was thinking about this choice of characters 0 to n-1. The original Porter
stemmer was coded up so that a program switch place on the character
position which gave rise to largest number of cases. For this optimisation
one would like to pick out the character position with the smallest number
of cases, but only supposing that the corresponding character from the
string being stemmed had an even spread. Clearly for utf-8 encoded
characters, the spread is very eccentric, but taking the n-1 in the string
forwards among avoids the constant-character problem.
At 18:47 18/09/2006 +0100, Olly Betts wrote:
>On Mon, Sep 18, 2006 at 06:15:22PM +0200, Martin Porter wrote:
>> For string-forward among, surely the byte to take is not byte 0, but byte
>> n-1, where n is the size of the smallest string in the among.
>Are you saying it's currently incorrect?
>Or that taking this byte may give a better optimisation, because it
>avoids the problem with Cyrillic characters always starting with one of
>two bytes in UTF-8?
>Assuming the later, since we know the cases when we generate the
>shortcut, we could actually look at all the different choices of
>bytes between 0 and n-1 and potentially chose a different strategy
>for each among.
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST