Re: [Snowball-discuss] More patches

From: Olly Betts (olly@survex.com)
Date: Mon Feb 12 2007 - 05:04:49 GMT


On Mon, Feb 12, 2007 at 01:38:26AM +0000, Richard Boulton wrote:
> Olly Betts wrote:
> >This improves the shortcutting of backwards among - if there are fewer
> >characters available than the shortest string in the among, there's
> >no way it can match. It also includes a cosmetic tweak (avoiding
> >generating "z->c - 0" in the output) which makes the generated source
> >a little more readable (of course the C compiler will optimise the "- 0"
> >away anyway):
> >
> >http://oligarchy.co.uk/xapian/patches/snowball-min-length-shortcut-backwards-among.patch
>
> This is the most interesting of the bunch, but I'm not awake enough to
> consider it in detail. I'll take a closer look at this patch tomorrow.

I didn't say before but all the stemmers which have test vocabularies
behave the same way with this patch applied.

> I've applied this (in snowball revision 428), but I also needed to
> change the signature of the sort function (to accept "const void *"
> pointers in the comparison function instead of "void *" pointers) to
> avoid a compiler warning. Of course, you won't see this in Xapian
> because of the change to qsort.

Ah yes, sorry.

> >http://oligarchy.co.uk/xapian/patches/snowball-use-qsort.patch
>
> This seems a reasonable patch, but I'm very slightly concerned about
> whether qsort really is ubiquitous. The man page indicates that it
> conforms to SVr4, 4.3BSD, C99, which is pretty good. If it had C89 as
> well, I'd have no qualms. Performance isn't particularly critical in
> the compiler, anyway, for present usage.

The qsort man page I have says "ISO 9899", which is actually just "the
ISO C standard", not C99 (the "99" is coincidental) - look at the
"Revises:" lists here, and you'll see that there's a 1990 version as
well as the current 1999 revision:

http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29237&showrevision=y

The "89" in "C89" is the date of the ANSI standard (1989). ISO approved
it in 1990.

Anyway, as Martin (not Porter) says, qsort was in the 1989/1990 standard
too.

> Reducing the amount of code needed to be maintained is the main benefit
> of this, and that's a worthy goal given the amount of time we have to
> maintain snowball. I'm going to apply this patch for now - Martin, or
> anyone else, if you can think of any reason not to apply it, I can
> easily revert the change.

The merge sort is (I think) a stable sort, whereas qsort() may not be
(despite the "q" in the name, the standard doesn't require it to be
implemented as a quick sort, but does explicitly say that the order of
two elements which compare equal is unspecified). But it's only used to
sort the strings in an "among", and the snowball manual says "The 'Si'
must all be different" (and adding a quick check to the comparison
function confirms this for the current algorithms).

> (Olly - applied in snowball revision 429 - I removed the prototype of
> "sort" entirely from snowball/compiler/header.h, rather than commenting
> it out, so I've also applied this change to Xapian to avoid confusion.)

Sure.

> I suggest the following patch instead:
> http://www.tartarus.org/~richard/xapian-patches/snowball-disable-java-support2.patch
> which I've applied to snowball.
>
> Xapian's "languages/Makefile.mk" would then be modified to include
> "-DDISABLE_JAVA" on the compile line for the snowball compiler.

Works for me.

Thanks for handling these so promptly!

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST