Re: [Snowball-discuss] problems with Finnish

From: Richard Boulton (richard@tartarus.org)
Date: Sat Sep 21 2002 - 12:00:01 BST


On Sat, 2002-09-21 at 09:50, Martin Porter wrote:
> Richard Boulton's Java generated code does not at the moment implement these
> supplementary function calls, since they went into Snowball after he had
> written the Java codegenerator. I had quite forgotten this when releasing
> the Finnish stemmer. So apologies, Alex, for the time you've spent
> discovering that.
>
> We'll have to wait for a reply from Richard to see whether he's prepared to
> do more work here, but meanwhile we must remember that Finnish stemming is
> hardly in great demand! (I might have a go at adding it in, but it means
> getting into Java again.)

I've just had a quick look at this. As far as I can tell, the function
call stuff _is_ implemented in the Java version: compare, for example,
the "find_among" function in net/sf/snowball/SnowballProgram.java with
that in q/utilities.c. The function call stuff in the Java version
involves storing the name of the routine in the Among class, and using
introspection to call the appropriate routine when desired.

I've never tested it carefully though, so it may not be working
properly.

I'm afraid I havn't got the time to look into this in more detail: my
Java system is giving me strange errors due to a bad installation and I
havn't the time to sort it out.

> But [this is to Richard] it is not too hard to implement. The routine that
> interprets the 'among' structure contains a call back into the generated
> code corresponding to a call of the supplementary function. You just need to
> add this in in the code which you hand-translated into java - and you told
> me that was done very easily.
>
> Regarding Russian, the java and C systems have been tested, and match, the
> the issue must be the character set. Are you using Unicode without 'symbol'
> set to two-bytes?

Actually, this isn't quite true: only the output of the english version
has been carefully checked. This is because, when I implemented the
Java stuff, the character set handling code hadn't been suitably
implemented, and I therefore couldn't easily run the tests. I havn't
had time since. So I'm pleased to hear that all the other foreign
languages check out correctly. :)

One of the original reasons for implementing the Java stemmers was so I
could use them with Lucene. In the end, we used a different system
though, so I'm glad someone else is finding them of use.

-- 
Richard



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST