[Snowball-discuss] Java stemmers

From: Richard Boulton (richard@tartarus.org)
Date: Mon Jan 21 2002 - 10:55:41 GMT


I've done a little work recently to make snowball capable of generating
Java output, rather than just C output. I want this for some projects
using Java, since I don't want to have to fiddle with native methods,
etc.

It's got to the stage now where it's producing valid and correct Java
for the English stemming algorithm, and possibly for the other
languages. I've checked the output of the English algorithm, and it
matches the output from the C version, which is a good sign. When I get
time I'll expand the build process so it can compare the output for all
the languagues automatically.

The bad side of this is that the Java stemmers are much slower than the
C stemmers at the moment. For example, the english stemmer in C takes
0.2 seconds to process the test vocabulary: the Java stemmer takes 5
seconds for the same task. I'm sure that the Java stemmer performance
can be improved immensely; it's probably doing something silly such as
converting a String to a StringBuffer and back millions of times. I've
done no profiling or anything similar, as yet.

Shall I commit the work now, anyway? I'd like to get it into revision
control reasonably soon. It won't interfere with the existing code
generator: it will simply add another few parameters to snowball to
specify the name of the output file.

-- 
Richard

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST