[Snowball-discuss] Re: Possible memory leak in Snowballs Java stemmer

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Mon May 17 2004 - 09:51:19 BST


Wolfram,

Thank you for your carefully researched email. The Java generator is the
work of Richard Boulton and we must refer it to him. I hope we'll get an
answer back to you shortly. (I know he is rather busy at the moment)

I am posting this to snowball-discuss@lists.tartarus.org. PLease post
subsequent replies to that address,

Martin

At 16:43 14/05/2004 +0200, Wolfram Esser wrote:
>Hello Mr. Porter!
>
>I am really appreciating you work on the stemming field.... I'm using
>the German stemmer extensively to find typing mistakes in large
>electronic encyclopedias.
>
>I am using the Java stemming engine which is provided by this link:
> http://snowball.tartarus.org/snowball_java.tgz
>Which is - to my knowledge - the current version of the Java stemmer.
>
>_*Problem:*_
>When stemming about 500,000 words and generating a Java hashmap which
>maps all the stemms to their corresponding words, I get OutOfMemory
>exceptions - even with about 700MB of java heapspace and with about 1GB
>of machine RAM. This is strange, because the raw data needed must be
>something like 6MB+6MB+small(X) about 20-30 MB of RAM.
>
>_*Analysis:*_
>According to your Java TestApp (delevered with the above archive), after
>calling stem() one has to use SnowBallProgram.getCurrent() to get the
>stem of the stemmed word. This method does the following:
> public String getCurrent()
> {
> return current.toString();
> }
>
>So it converts the StringBuffer current to a String - but:
>StringBuffer.toString() does a so called "lazy copy" - it does NOT
>create a fresh new String wich is returned, but instead it creates a
>"hollow" String object, where it points the data-buffer to the existing
>StringBuffer. So the StringBuffer and the new String point to the exact
>same memory block.
>
>So when StringBuffer has allocated 2MB (and only 10 bytes used, which is
>OK for a StringBuffer!), then the new String points also to a 2MB memory
>block whith only 10 bytes used.
>Java memorizes this fact and when the StringBuffer changes its value,
>then the actual copy if the memory is done - but to late! The String
>object occupies 2MB - and always will - even if only 10 bytes contain
>useful characters!
>
>As people are calling SnowBall's getCurrent() method often - they almost
>always get String objects that occupy a lot of useless memory. This is
>O.K., if they do only use these String for example (like in your
>TestApp), to do a System.out.println() and discard them afterwards. Then
>moemy will be freed by Java's garbage collector. But when you keep
>references to those Strings (e.g. as keys in a HashMap, like in my
>case!), machines memory runs out lightning fast! Actually I could only
>store about 300,000 stems in my Hashmap which occupied 600MB of RAM at
>that time!
>
>So, reusing StringBuffers is actually a usage case which maybe was not
>intented by the developers of the StringBuffer class.
>
>
>_*Solution:*_
>
>Either the user or you library can do something like this
> String myStem = new String( germanStemmer.getCurrent());
>
>
>or (which I woul prefer): rewrite the getCurrent method like one of the
>following (this prevents lib users of using the library in a maybe
>dangerous was):
>
> public String getCurrent()
> {
> return new String(current);
> }
>
>or
> public String getCurrent()
> {
> return current.substring(0);
> }
>
>in both cases only the actual amount of occupied characters is stored in
>the new String object.
>
>
>
>I dont know who is actually caring for the Java part of Snowball. But
>I'm sure you can forward this eMail to him/her.
>I really would like to hear from your team, if you could reproduce my
>problem and find the solution helpful.
>Or did I overlook some other (memory saving) means of getting the
>desired stem?
>
>Anyway: Thank you for your great work
>and greetings from Germany:
> Wolfram
>
>
>
>--
>
>---
>
> o Wolfram Esser (Dipl.-Inform.), Lehrstuhl fuer Informatik II
> / \ Universitaet Wuerzburg, Am Hubland, D-97074 Wuerzburg
>infoII o Phone: +49 (0)931-888-6614 Fax: +49 (0)931-888-6603
> / \ mailto:esser@informatik.uni-wuerzburg.de
> o o http://www2.informatik.uni-wuerzburg.de/staff/wolfram/
>
>



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST