Re: [Snowball-discuss] New, and a couple of questions

From: Richard Boulton (richard@tartarus.org)
Date: Wed Mar 10 2004 - 13:50:01 GMT


Martin Porter wrote:
> There is no fast way of discovering if a word has been stemmed. You could
> set a flag in the various functions of q/utilities.c that alter z->p, but
> this is not a general solution, since Snowball can use auxiliary strings
> that may be altered while the main string remains unaltered - although none
> of the current stemmers would do that. So you have to use strcmp or equivalent.

As Martin says, you'll have to compare the strings returned. However,
if you're worried about speed, don't use strcmp - you can write your own
comparison routine which is faster for this case. In particular, you
have the two lengths, so the first step is clearly to compare them - if
they differ, the stemmed form is different from the original.

Also, if a stemming operation has occurred, it will typically change the
  end of the word rather than the beginning - so compare the strings
starting at the end.

The worst case is still the case where stemming hasn't occurred, in
which case you have to compare the whole string. However, you can
probably speed the case where stemming _has_ occurred to probably an
average of around 2 comparisons.

...
> But my machine is fairly slow: on Richard
> Boulton's machine it would take a tenth of that time.

Actually, the original test takes 17 seconds on my machine. (Unless I
turn optimising on, in which case it notices that strcmp is a pure
function (ie, has no side effects) and I'm ignoring the return value, so
doesn't bother to call it at all, and takes 0.1 seconds.)



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST