On Tue, 15 Apr 2003, Oleg Bartunov wrote:
> Did you try Lingua::Stem::Snowball ?
No...It doesn't appear to be available on CPAN. ;)
I'm trying to compile the snowball code from sources now, but the make
Ok. I've got the 'porter2' stemmer installed as 'english'. Benching...
> It's not pure perl wrapper but uses XS, so should be much faster.
Not bad. Not great, but not bad. Snowball comes in about 2X faster than
SNOWBALL: Average random cross-sectional stem rate for 100 words: 7930.21 Hz (n=200000).
I suspect you have underestimated both the performance of well written
########################################################################################
#!/usr/bin/perl
use Benchmark;
my @word = grep chomp, <>;
#################################################
# Word by word using Snowball
foreach my $w (@word_list) {
# Word by word (original benchmark)
# Processed in batch instead of one by one
# Processed in batch instead of one by one, with caching turned on
>
> Oleg
> On Mon, 14 Apr 2003, Benjamin Franz wrote:
>
> > I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
> > version that 'wrappers' the Snowball based Perl stemmers (along with some
> > non-Snowball based versions) into the standarized Lingua::Stem API.
> >
> > Something I noticed today while looking for anyone who might be using the
> > Lingua::Stem Perl module was that last year it was mentioned as being a
> > poor performer here on the Snowball-Discuss list. After examining the
> > benchmark code used (see
> > <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
> > I discovered the main reason it performed poorly in the tests was it was
> > being used in its absolutely lowest performing mode (one word at a time,
files appear to be 'fragile' - they are not yet compiling for me.
Hmmm...You are aware that the module can't be compiled by following the
directions provided if using the 'porter' (not 'porter2') stemmer?
> Also, it doesn't add "additional errors" :-) I recollect Martin
> has worried about many errors in selfmade stemmers claiming they are
> Porter's stemmer.
the slowest mode of Lingua::Stem - but substantially slower than either
the batch and the batch+cache modes of the pure Perl Lingua::Stem.
ORIGINAL: Average random cross-sectional stem rate for 100 words: 3644.98 Hz (n=200000).
BATCHED: Average random cross-sectional stem rate for 100 words: 11848.34 Hz (n=200000).
CACHED: Average random cross-sectional stem rate for 100 words: 86580.09 Hz (n=200000).
Perl and the 'overhead' of the Perl-XS interface. Processing words across
the Perl-XS interface one by one is _EXPENSIVE_ in CPU time.
use Lingua::Stem qw (:all :caching);
use Lingua::Stem::Snowball;
# Preload word list so we have identical runs
my @word_list = ();
my $s = 100;
for (1..$s) {
my $result;
my $w = @word[rand(scalar(@word))];
push (@word_list,$w);
}
my ($n,$pu,$ps) = (0,0,0);
my $result;
my $t = timeit(2000, sub { ($result) = snowball('english',$w) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
}
printf "SNOWBALL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
my ($n,$pu,$ps) = (0,0,0);
foreach my $w (@word_list) {
my $result;
my $t = timeit(2000, sub { ($result) = stem($w) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
}
printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
stem_caching({ -level => 2});
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $
--
Benjamin Franz
> > no stem caching). I thought it would be worth re-doing the benchmark using
> > its faster modes. So, here is the benchmark code redone to take full
> > advantage of Lingua::Stem's performance features:
> >
> > #!/usr/bin/perl
> >
> > use Benchmark;
> > use Lingua::Stem qw (:all :caching);
> >
> > my @word = grep chomp, <>;
> >
> > #################################################
> > # Preload word list so we have identical runs
> > my @word_list = ();
> > my $s = 100;
> > for (1..$s) {
> > my $result;
> > my $w = @word[rand(scalar(@word))];
> > push (@word_list,$w);
> > }
> >
> > # Word by word (original benchmark)
> > my ($n,$pu,$ps) = (0,0,0);
> > foreach my $w (@word_list) {
> > my $result;
> > my $t = timeit(2000, sub { ($result) = stem($w) } );
> > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> > }
> > printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > # Processed in batch instead of one by one
> > my ($n,$pu,$ps) = (0,0,0);
> > my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> > printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > # Processed in batch instead of one by one, with caching turned on
> > stem_caching({ -level => 2});
> > my ($n,$pu,$ps) = (0,0,0);
> > my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> > printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > #################################################################
> >
> > The results of running it on my home Celeron 500 based Redhat 7.3
> > Linux system with Perl 5.6.1 using the voc.txt file are as follows:
> >
> >
> > ORIGINAL: Average random cross-sectional stem rate for 100 words: 3718.16 Hz (n=200000).
> > BATCHED: Average random cross-sectional stem rate for 100 words: 13097.58 Hz (n=200000).
> > CACHED: Average random cross-sectional stem rate for 100 words: 88105.73 Hz (n=200000).
> >
> > Batching alone is about 3.5X improvement. Adding stem caching as well
> > gives a 23.7X improvement over the one word at a time processing (and, I
> > judge, leaves even the fastest performers benchmared last summer
> > completely in the dust by roughly a factor of 8 to 10x). Since I have
> > wrappered the non-English Snowball stemmers, they should get similiar
> > performance improvments from the stem cache when used via the base
> > Stem::Lingua modules in the new 0.60 release when used in a mode where the
> > caching is significant.
> >
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
-- Jerry
"If the code and the comments disagree, then both are probably wrong." -- Norm Schryer, Bell Labs
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST