Re: [Snowball-discuss] Lingua::Stem

From: Benjamin Franz (snowhare@nihongo.org)
Date: Tue Apr 15 2003 - 04:31:01 BST


On Tue, 15 Apr 2003, Oleg Bartunov wrote:

> Did you try Lingua::Stem::Snowball ?

No...It doesn't appear to be available on CPAN. ;)

I'm trying to compile the snowball code from sources now, but the make
files appear to be 'fragile' - they are not yet compiling for me.
Hmmm...You are aware that the module can't be compiled by following the
directions provided if using the 'porter' (not 'porter2') stemmer?

Ok. I've got the 'porter2' stemmer installed as 'english'. Benching...

> It's not pure perl wrapper but uses XS, so should be much faster.
> Also, it doesn't add "additional errors" :-) I recollect Martin
> has worried about many errors in selfmade stemmers claiming they are
> Porter's stemmer.

Not bad. Not great, but not bad. Snowball comes in about 2X faster than
the slowest mode of Lingua::Stem - but substantially slower than either
the batch and the batch+cache modes of the pure Perl Lingua::Stem.

SNOWBALL: Average random cross-sectional stem rate for 100 words: 7930.21 Hz (n=200000).
ORIGINAL: Average random cross-sectional stem rate for 100 words: 3644.98 Hz (n=200000).
BATCHED: Average random cross-sectional stem rate for 100 words: 11848.34 Hz (n=200000).
CACHED: Average random cross-sectional stem rate for 100 words: 86580.09 Hz (n=200000).

I suspect you have underestimated both the performance of well written
Perl and the 'overhead' of the Perl-XS interface. Processing words across
the Perl-XS interface one by one is _EXPENSIVE_ in CPU time.

########################################################################################

#!/usr/bin/perl

use Benchmark;
use Lingua::Stem qw (:all :caching);
use Lingua::Stem::Snowball;

my @word = grep chomp, <>;

#################################################
# Preload word list so we have identical runs
my @word_list = ();
my $s = 100;
for (1..$s) {
  my $result;
  my $w = @word[rand(scalar(@word))];
  push (@word_list,$w);
}

# Word by word using Snowball
my ($n,$pu,$ps) = (0,0,0);

foreach my $w (@word_list) {
  my $result;
  my $t = timeit(2000, sub { ($result) = snowball('english',$w) } );
  $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
}
printf "SNOWBALL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

# Word by word (original benchmark)
my ($n,$pu,$ps) = (0,0,0);
foreach my $w (@word_list) {
  my $result;
  my $t = timeit(2000, sub { ($result) = stem($w) } );
  $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
}
printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

# Processed in batch instead of one by one
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

# Processed in batch instead of one by one, with caching turned on
stem_caching({ -level => 2});
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $

-- 
Benjamin Franz

> > Oleg > On Mon, 14 Apr 2003, Benjamin Franz wrote: > > > I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new > > version that 'wrappers' the Snowball based Perl stemmers (along with some > > non-Snowball based versions) into the standarized Lingua::Stem API. > > > > Something I noticed today while looking for anyone who might be using the > > Lingua::Stem Perl module was that last year it was mentioned as being a > > poor performer here on the Snowball-Discuss list. After examining the > > benchmark code used (see > > <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>) > > I discovered the main reason it performed poorly in the tests was it was > > being used in its absolutely lowest performing mode (one word at a time, > > no stem caching). I thought it would be worth re-doing the benchmark using > > its faster modes. So, here is the benchmark code redone to take full > > advantage of Lingua::Stem's performance features: > > > > #!/usr/bin/perl > > > > use Benchmark; > > use Lingua::Stem qw (:all :caching); > > > > my @word = grep chomp, <>; > > > > ################################################# > > # Preload word list so we have identical runs > > my @word_list = (); > > my $s = 100; > > for (1..$s) { > > my $result; > > my $w = @word[rand(scalar(@word))]; > > push (@word_list,$w); > > } > > > > # Word by word (original benchmark) > > my ($n,$pu,$ps) = (0,0,0); > > foreach my $w (@word_list) { > > my $result; > > my $t = timeit(2000, sub { ($result) = stem($w) } ); > > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5]; > > } > > printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n; > > > > # Processed in batch instead of one by one > > my ($n,$pu,$ps) = (0,0,0); > > my $t = timeit(2000, sub { ($result) = stem(@word_list) } ); > > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s; > > printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n; > > > > # Processed in batch instead of one by one, with caching turned on > > stem_caching({ -level => 2}); > > my ($n,$pu,$ps) = (0,0,0); > > my $t = timeit(2000, sub { ($result) = stem(@word_list) } ); > > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s; > > printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n; > > > > ################################################################# > > > > The results of running it on my home Celeron 500 based Redhat 7.3 > > Linux system with Perl 5.6.1 using the voc.txt file are as follows: > > > > > > ORIGINAL: Average random cross-sectional stem rate for 100 words: 3718.16 Hz (n=200000). > > BATCHED: Average random cross-sectional stem rate for 100 words: 13097.58 Hz (n=200000). > > CACHED: Average random cross-sectional stem rate for 100 words: 88105.73 Hz (n=200000). > > > > Batching alone is about 3.5X improvement. Adding stem caching as well > > gives a 23.7X improvement over the one word at a time processing (and, I > > judge, leaves even the fastest performers benchmared last summer > > completely in the dust by roughly a factor of 8 to 10x). Since I have > > wrappered the non-English Snowball stemmers, they should get similiar > > performance improvments from the stem cache when used via the base > > Stem::Lingua modules in the new 0.60 release when used in a mode where the > > caching is significant. > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > _______________________________________________ > Snowball-discuss mailing list > Snowball-discuss@lists.tartarus.org > http://lists.tartarus.org/mailman/listinfo/snowball-discuss >

-- Jerry

"If the code and the comments disagree, then both are probably wrong." -- Norm Schryer, Bell Labs



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST