Re: [Snowball-discuss] Optimising among

From: Olly Betts (olly@survex.com)
Date: Mon Sep 18 2006 - 18:35:39 BST


On Mon, Sep 18, 2006 at 01:54:42PM +0200, Martin Porter wrote:
> This is very helpful, and many thanks. We will certainly incorparate this
> extension.

I'd suggest not committing it as it stands. I've realised that the
compiler code can be simplified quite significantly - I'd not noticed
that the among strings are sorted at this point, which is actually
very handy.

I've also been looking at treating groupings in a similar way which
gives a nice extra speed boost.

> 0) In general, we must be extremely careful now not to introduce any bugs
> during codegenerator enhancement. There is nothing more demoralising than
> compiler bugs, and now that we are beginning to receive Snowball submissions
> from 3rd parties we must be most careful to avoid them. Interestingly,
> although the earlier submission had a bug in the "z->c == z->l" test
> (missing the possibility of a null string in the among(...)
        
Actually, the first patch didn't attempt to handle any among which
contains a null string (the relevant test is:
    
    if (among_cases[0].size) {

But I agree with the general point that it would be bad to introduce bugs in
the generated code.

> the more
> elaborate 2nd submission is bug free -- as far as I can tell. But we must do
> all the testing we can. (I have a test suite anyway, so does Richard.)

I've been testing on the sample vocabularies so far, though these don't
offer perfect coverage (e.g. Xapian's testsuite checks that random binary
strings don't crash or hang the stemmers). It's possible the current
algorithms don't exercise "among" in all the possible ways either.

If you have more comprehensive testsuite it would be useful to have it
(and maybe hook it up so that "make check" in the source tree verifies
that the snowball compiler has been successfully built).

> 1) A pity in some ways that Richard suggested special code for fewer than 3
> strings in an among, since that must be rare surely?

The English stemmer has 2 amongs with only 2 strings, but it's not just
fewer than 3 strings - it's fewer than 3 different initial (or final for
backwards among) characters. The English stemmer has two examples of
this (both backwards):

            [substring] among (
                '{'}' '{'}s' '{'}s{'}'
                       (delete)
            )

and:

        [substring] atlimit among(
            'inning' 'outing' 'canning' 'herring' 'earring'
            'proceed' 'exceed' 'succeed'

            // ... extensions possible here ...

        )

> No strings is forbidden. For 1 string, among('a' C) is the same as 'a' C,
> and so not useful.

OK - it wasn't clear to me if these corner cases were useful or even valid so
rather than worry about it, I just ensured they still worked if they worked
before.

> For 2 strings, among('a1' (C1) 'a2' (C2)) might happen, but I
> suspect is rare, and in any case one doubts if the double test on a
> character is faster than the bitmap approach (see point 4).

I can check, but I believe this did give a measurable speedup.

> 2) Perhaps ~I3 (at one point in the code) should be ~I3L, since we are up to
> 32 bits, or is that pedantic?

For my own use 16 bit int isn't a concern, but there may be niches where it's
still in use. However, 64 bit long is increasingly common too and it would be
a shame if forcing use of long slowed things down. I'll investigate - if need
be we can have a macro to allow the appropriate 32 bit type to be specified at
C compile time, defaulting to long if not specified.

> 3) The case that doesn't work so well is a string-forward among(...) in utf8
> mode when the strings being tested have leading characters > 128. Then the
> first byte of the utf8 characters will often be the same, and the
> optimisation doesn't pull out any exceptions (utf-8 cyrillic for example).

There are no forward amongs in the Russian stemmer, but all the cyrillic
characters used start 0xd0 or 0xd1 in utf-8. If there were a forward among and
all the cases start wih cyrillic letters in one half of the cyrillic alphabet
then it would still be a useful optimisation (if you ignore the direction,
one of the amongs in the Russian stemmer would allow this).

> It did occur to me that perhaps only the string-backward case is worth
> optimising.

For English at least, the improvement was measurable.

> 4) I think we should dispense with the 'block' idea. The bitmap test is a
> way if discarding unwanted cases, and if a match is found with the bitmap on
> a character that does not begin an among string, the among function then
> finds it can be discarded. This double test will happen only rarely, and you
> save one test in the optimisation, as well as generating the optimisation
> for all amongs, rather than the subset where leading characters of strings
> are in the same block.

It does allow us to avoid the bitmap test, which requires loading a constant
and may be relatively slower (depending on the constant and the architecture).
Certainly worth investigating though. We could certainly discard the block
check (or widen it) for cases which we currently can't cover.

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST