Re: [Snowball-discuss] Optimising among

From: richard@lemurconsulting.com
Date: Fri Sep 15 2006 - 00:49:44 BST


On Thu, Sep 14, 2006 at 06:57:25PM +0100, Olly Betts wrote:
> So for each among I look at the first character to be checked in each
> string. If they are all the same mod 32, I add a simple bitmap check
> which in many cases means I can give signal f before even calling
> find_among or find_among_b.

I assume you mean "all the same div 32" (not mod) here, reading the code.

This makes a lot of sense to me. I'll give Martin a couple of days to
comment, since I've not looked at this bit of the code in detail, but if he
has no objections, I'd be happy to commit this.

I wonder if the characters in the among structures used are as nicely
chunked into groups of 32 characters in other languages (particularly
Russian). Taking a quick look at the algorithms makes me think they are,
but if not, the same kind of idea might be applied with a larger bitmap, or
some such. Anyway, they probably are.

> Because the strings are typically composed of lower case unaccented
> latin letters this gets most of them. This reduces the time spent in
> find_among_b by 45% and reduces the overall runtime for running
> stemwords on the English voc.txt by 15% (the "times" are all cycle
> estimates from cachegrind since timing doesn't give stable values on
> this machine). find_among_b still dominates the profile, but down from
> ~33% to ~22% of the time.

That's quite a nice improvement, well done!

-- 
Richard



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST