[Snowball-discuss] Update on regex approach

From: Allan Fields (afieldsml@idirect.ca)
Date: Wed May 08 2002 - 17:56:11 BST


Hi,

Sorry I haven't dropped by for a while, but I'm quite busy. I'll try to get
my updated Perl stemmer out with-in the next month. More benchmarking to
come. =) Biggest issue is with overhead of multiple words -- perl can be a
real beastie performance wise I've witnessed.

My other attempt to speed up the Perl stemmer that I've also been working on
is stuck on a few technical details of the measure of words. One idea I've
had is to separate finding the measure from the main transform stage by using
a reduced set representation in deriving the measure while using a single
regular expression in substitution with supporting inline logic. s///e The
biggest issue with this approach, is that at different points it in necessary
to look-behind to see if the new measure has changed or is past a minimal
boundry. If there was a way to use integers to represent the logic of the
{c, v, C, V} sequences, it might significantly speed up that stage by making
the operations integer operations instead. I would consider this more
optimal in that, by forcing larger memory usage (still paltry on todays
computers), it would be possible to conserve processor time.

Also, by inlining all the logic to a single substitution, it could be said
that perl's larger overhead is reduced somewhat. Now I'm not sure it would
compare to the C version, but I'm postulating it will be significantly faster
than most other approaches in Perl. (Although it won't be as algorithmic
moving lots of the procedural elements to the regex itself.)

This has lead me to believe that it may be possible to create a snowball
compiler that creates stemmers using Perl regexes at most and at the least
using sed for instance. There are lots of options for snowball compilation
currently, but it would have a special geek appeal to make this in sed. Some
one, please do beat me to it! ;)

Allan

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:41 BST