[Snowball-discuss] Re: Possible memory leak in Snowballs Java stemmer (Richard Boulton)

From: Chris Cleveland (ccleveland@dieselpoint.com)
Date: Thu Jun 03 2004 - 23:57:02 BST


Richard,

I missed the original message about Java and memory problems, but I've done a fair amount of thinking about how the Java code could be re-architected. Here are the difficulties with the current system:

1. Multithreading. In a multithreaded app, like a web app, you have to create a new instance of a stemmer for each thread. This generates garbage for each new thread. The reason is that there are class variables in SnowballProgram.java, and two simultaneous calls to stem() will cause problems. Declaring stem() to be synchronized solves the threading problem, but it kills performance.

2. Reflection. Among.java relies upon reflection to select a stemmer. Reflection is slow and causes big problems for obfuscators.

3. The relationship between Among, SnowballProgram, and the individual stemmers is complicated.

A better approach would be to eliminate Among entirely. Don't use class.forName() at all. Just put the code which is common to all stemmers in SnowballProgram, and have each stemmer inherit from it.

If you modify the stem() method to refrain from accessing any variables defined outside the method itself then the multithreading problem will go away.

Another way to make things *much* more efficient is to eliminate all use of Strings and StringBuffers. Strings always generate garbage and StringBuffers have a lot of synchronized methods. Instead, pass char[] arrays to stem() which contain contain the input and receive the output.

Here's some sample code:

// EnglishStemmer inherits from SnowballProgram, and can be shared by multiple threads
SnowballProgram stemmer = new EnglishStemmer();

String input = "hello";
int inLength = input.length();
int inOffset = 0;
char [] in = new char[64];
input.getChars(0, inLength, in, 0);

char [] out = new char[64];
int outOffset = 0;
int outLength = stemmer.stem(in, inOffset, inLength, out, outOffset);

The in and out buffers are reusable, making it possible to stem many words without generating any garbage at all. Of course, this scheme is only possible if all stems are always shorter than some known value, like 64 chars.

Chris



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST