Re: [Snowball-discuss] Java stemmers

From: Richard Boulton (richard@tartarus.org)
Date: Wed Jan 23 2002 - 19:39:10 GMT


On Mon, 2002-01-21 at 14:23, Martin Porter wrote:
> Commit whenever you like, it sounds good to me.

Committed.

> There are a few queries: would it not be best to sort out the unicode issue
> if we are supporting Java?

It's no more of a problem than with C. The main difference with Java is
that char's are 16 bits already, so we don't need to change the
generated code to deal with that. The main thing, as you said, is to
ensure that groupings work with 16 bit character codes.

> Is there just another codegenerator module that gets linked in with the rest
> of Snowball? Is it in ANSI C?
Just another codegenerator, yes: generator_java.c
Snowball now has a -java paramater which causes it to produce a .java
file. I'll write full documentation for the Java generator once it's
stabilised a bit more.

> I wonder how you got around the use of goto's in my codegenerator ...

Named breaks:

    foo;
    if (a) break lab1;
    bar;
    if (b) break lab1;
    baz;
 lab1:

becomes:

    foo;
    lab1: do {
        if (a) break lab1;
        bar;
        if (b) break lab1;
        baz;
    }

> I hit a speed issue with the Java version of the Porter stemmer that had the
> same order-of-magnitude difference from the C version that you report. I
> found that all the time was being lost in IO. You can easily test that by
> calling the stemmer up twice per word, to see how much time is spent in the
> central stemming bit. It would be interesting if you had the same problem.

That's certainly possible. :(

-- 
Richard

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST