Re: [Snowball-discuss] Optimising among

From: Olly Betts (olly@survex.com)
Date: Fri Sep 15 2006 - 05:00:47 BST


On Fri, Sep 15, 2006 at 02:05:54AM +0100, Olly Betts wrote:
> It might also be worth special casing when we have 2 or fewer candidates
> since checking "z->p[z->c] == 'a' || z->p[z->v] == 'b'" is likely to be
> faster than the bit shifting and and-ing. And this would also allow us to
> add a check for the among with "'s'", "'s" or "'" in the English stemmer
> which current fails to fit in a 32 character block.
>
> The other case I don't deal with in the English stemmer is one which
> includes an empty string - this just needs a slightly different variant
> of the bitmap check (which I'm just trying to implement now).

Implementing all the above takes the reduction in running time for
the English stemmer to 46%!

That probably means that the snowball version is now comparable to the
hand-coded C (Martin said the difference was about a factor of 2).

I suspect not all languages will benefit as much - for English I'm now
adding shortcut checks for every among, but I'm not able to for every
language.

http://www.oligarchy.co.uk/xapian/patches/snowball-among-speedup-2.patch

I'd be interested to hear how this benchmarks when measuring real time
rather than running under cachegrind.

> Looking at what doesn't get encoded for other languages, it's pretty
> common to have all lower case letters and one other character (typically
> an upper case letter, accented character, or an apostrophe). Perhaps
> that's worth checking for too.

There are 58 amongs which are still not handled of which 31 could be
with a bitmap plus a single additional value check (5 of these can
also be done with 3 compares, which is probably more efficient).

Incidentally, I checked and for the current algorithms not a single
extra case could be handled by allowing the bitmap to cover a
non-aligned 32 value range instead of an aligned one.

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST