Re: [Snowball-discuss] Local variables for snowball

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Tue Sep 12 2006 - 08:44:31 BST


>In particular, for
>the English stemmer around a third of the time is spent in
>find_among_b()

-- and similarly for the other stemmers. find_among_b is an 'among(...)'
done backwards, which is what the stemmers are doing most of the time.

>so that's a good candidate for optimising (if only I
>understood it!)

-- and it has been optimised accordingly. (Not that I'm saying it could not
be made faster!) If an among contains N strings, an incoming string is
tested against them by a series of byte-to-byte comparisons. The average
number of such tests, f(N), can be compared with N and used as a measure of
the speed of doing the among. One way of attempting further optimisation is
to come up with a scheme which reduces f(N) further.

Most among calls give signal f, that is, usually the end of a word is not
found in the among(...). This simple fact is not utilised in any way in the
implementation of among(...), and I've often thought that it might be.

Of course the among(...) implementation is a general procedure which misses
obvious optimisations for certain classes of string. A test of the form

    among('a','e','i','o','u')

could be done by bit-map look up etc. It is because the hand-coding of the
Porter stemmer spots these things that it runs about twice as fast as the
Snowball equivalent.

Martin



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST