[Snowball-discuss] Re: Benchmarking Lingua::Stem

From: Allan Fields (afieldsml@idirect.ca)
Date: Tue Apr 15 2003 - 03:14:01 BST


Hi,
On Mon, Apr 14, 2003 at 09:04:19AM -0700, Benjamin Franz wrote:
> On Mon, 14 Apr 2003, Benjamin Franz wrote:
>
> > While doing a little Googling to find out if/how people are using Perl
> > modules I've written, I ran across the Snowball-discuss list benchmarking
> > of Lingua::Stem last year
> > <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>
> >
> > I feel that your criticism of the performance of Lingua::Stem is
> > mis-placed. Your benchmark used it in its _lowest_ performance mode (one
> > word at a time, caching disabled). If you process _all_ the words in one
> > pass you will find its performance _exceeds_ its competition - by quite a
> > large margin. That is _without_ even turning on the stem caching system
> > (which can multiply the performance several times on large stemming
> > operations).

Yes, that's true..

I guess this should probably be followed up on the list as a point of
clarification. When I did those benchmarks, I did them assuming there would
be only one word per call, and many (multiple) calls of the subroutines, so
it followed that subroutine overhead was as much an issue as the
implementation of the algorithm itself. I remember looking at the batch
features, but deciding to benchmark it the way I was expecting to call it.
In hindsight, that probably wasn't the fairest benchmarking strategy (and
it was a quick benchmark.)

As for my other point: although your stemmer was one of the most featureful,
I just hadn't understood why some of the program structure used symbolic
subroutine references. I'm not against TMTOWTODI, but it seemed to me that it
could have been a possible performance issue. But that's of little practical
importance if the speed doesn't suffer. If it is, in fact, improved by this
aside from the call overhead, then my benchmark would be misleading. So too
my conclusion about the code based on this benchmark result.

Since last year I've moved on to just using Snowball generated stemmers as
suggested on the list, although, I'm tempted to revisit my revised perl code
which avoided Perl overhead as much as was possible, as I managed to get at
least 5-10 times the performance by avoiding lexicals and such. That was my
point: that Perl itself could become the barrier to efficient stemming
implementation.

There's not much anyone can do about that, except hope that Perl 6/Parrot
offers some performance gains. Any algorithm that is called repeatedly in
Perl should do like you do with batching & caching (which when used properly
are great ideas) or minimize function call overhead and use optimal regular
expressions to glean the most performance (my other approach); otherwise it
might be best to use an XS based module. Not that I'm not a fan of Perl code.

Looking at your recent post to the list, that generalized interface appears
to be a good solution in the right direction to avoid the multiplicity of
stemmer implementations each with there own specific interface and possibly
caching schemes.

In the end, all this performance checking is probably an after thought,
these days PCs are nice and cheap.

> [snip]
>
> I forgot to include the version using stem caching. So here it is:
>
> ORIGINAL: Average random cross-sectional stem rate for 100 words: 3753.75 Hz (n=200000).
> BATCHED: Average random cross-sectional stem rate for 100 words: 12953.37 Hz (n=200000).
> CACHED: Average random cross-sectional stem rate for 100 words: 89285.71 Hz (n=200000).

Looks good.

> #!/usr/bin/perl
>
> use Benchmark;
> use Lingua::Stem qw (:all :caching);
>
> my @word = grep chomp, <>;
>
> #################################################
> # Preload word list so we have identical runs
> my @word_list = ();
> my $s = 100;
> for (1..$s) {
> my $result;
> my $w = @word[rand(scalar(@word))];
> push (@word_list,$w);
> }
>
> # Word by word (original benchmark)
> my ($n,$pu,$ps) = (0,0,0);
> foreach my $w (@word_list) {
> my $result;
> my $t = timeit(2000, sub { ($result) = stem($w) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> }
> printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one, with caching turned on
> stem_caching({ -level => 2});
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> --
> Jerry

Thanks for your clarification on this,
Allan



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST