Re: [Snowball-discuss] 16 bit characters in Snowball

From: Richard Boulton (richard@tartarus.org)
Date: Sat May 25 2002 - 14:37:44 BST


On Fri, 2002-05-24 at 20:47, Andreas Jung wrote:
> Seems that the problem is still not solved.
> I re-created all stemmers with and without -w option and in
> both cases snowball produced identical sources. Any ideas why?

Yes, -w doesn't change the output. What it does is allow snowball
programs to use character values in the range 0-65535 instead of 0-255.

A snowball program which can be generated successfully without -w will
not be affected by use of -w. However, a snowball program which uses
characters out of the range 0-255 will not be generated successfully
without -w.

If you're using -w to generate snowball output, you must also set
the typedef of "symbol" in api.h to something appropriate when you
compile the sources: see the comment at the start of api.h

Note that using -w and setting the size of symbol still doesn't
guarantee that the snowball program is using a 16 bit character set: see
the russian/stem.sbl file for an example: by default it uses KOI8-R (in
which all the character codes fit in one byte), but if you change the
comments around you can make it use Unicode instead.

-- 
Richard

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST