Re: [Snowball-discuss] Lingua::Stem

snowhare@nihongo.org

-- 
Benjamin Franz
> 
> 	Oleg
> On Mon, 14 Apr 2003, Benjamin Franz wrote:
> 
> > I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
> > version that 'wrappers' the Snowball based Perl stemmers (along with some
> > non-Snowball based versions) into the standarized Lingua::Stem API.
> >
> > Something I noticed today while looking for anyone who might be using the
> > Lingua::Stem Perl module was that last year it was mentioned as being a
> > poor performer here on the Snowball-Discuss list. After examining the
> > benchmark code used (see
> > <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
> > I discovered the main reason it performed poorly in the tests was it was
> > being used in its absolutely lowest performing mode (one word at a time,
> > no stem caching). I thought it would be worth re-doing the benchmark using
> > its faster modes. So, here is the benchmark code redone to take full
> > advantage of Lingua::Stem's performance features:
> >
> > #!/usr/bin/perl
> >
> > use Benchmark;
> > use Lingua::Stem qw (:all :caching);
> >
> > my @word = grep chomp, <>;
> >
> > #################################################
> > # Preload word list so we have identical runs
> > my @word_list = ();
> > my $s = 100;
> > for (1..$s) {
> >   my $result;
> >   my $w = @word[rand(scalar(@word))];
> >   push (@word_list,$w);
> > }
> >
> > # Word by word (original benchmark)
> > my ($n,$pu,$ps) = (0,0,0);
> > foreach my $w (@word_list) {
> >   my $result;
> >   my $t = timeit(2000, sub { ($result) = stem($w) } );
> >   $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> > }
> > printf  "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > # Processed in batch instead of one by one
> > my ($n,$pu,$ps) = (0,0,0);
> > my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> > printf  "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > # Processed in batch instead of one by one, with caching turned on
> > stem_caching({ -level => 2});
> > my ($n,$pu,$ps) = (0,0,0);
> > my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> > $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> > printf  "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
> >
> > #################################################################
> >
> > The results of running it on my home Celeron 500 based Redhat 7.3
> > Linux system with Perl 5.6.1 using the voc.txt file are as follows:
> >
> >
> > ORIGINAL: Average random cross-sectional stem rate for 100 words:  3718.16 Hz (n=200000).
> > BATCHED:  Average random cross-sectional stem rate for 100 words: 13097.58 Hz (n=200000).
> > CACHED:   Average random cross-sectional stem rate for 100 words: 88105.73 Hz (n=200000).
> >
> > Batching alone is about 3.5X improvement. Adding stem caching as well
> > gives a 23.7X improvement over the one word at a time processing (and, I
> > judge, leaves even the fastest performers benchmared last summer
> > completely in the dust by roughly a factor of 8 to 10x). Since I have
> > wrappered the non-English Snowball stemmers, they should get similiar
> > performance improvments from the stem cache when used via the base
> > Stem::Lingua modules in the new 0.60 release when used in a mode where the
> > caching is significant.
> >
> >
> 
> 	Regards,
> 		Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
-- 
Jerry
"If the code and the comments disagree, then both are probably wrong."
                                        -- Norm Schryer, Bell Labs