Re: [Snowball-discuss] 16 bit characters in Snowball

From: Andreas Jung (andreas@andreas-jung.com)
Date: Sat May 25 2002 - 17:58:42 BST


Why does snowball return a result as ascii when I pass a UTF-16 string:

#include "stem.h"
#include "header.h"

int main(int argc, char **argv) {
    int i;
    struct SN_env *z;
    char b[16] = {'a',0, 'a',0, 'r',0, 'g',0,
                  'a',0, 'u',0, 'e',0, 'r',0 };
    int l;

    z = german_create_env();
    SN_set_current(z, 8, (unsigned short *)b);
    german_stem(z);
    printf("%d\n",z->l);

    for (i=0;i<z->l;i++) printf("%d %c\n",z->p[i],z->p[i]);
    german_close_env(z);
    return 0;
}

Output:

yetix@/develop/REPOSITORY/snowball/website/german(80)% ./a.out
6
97 a
97 a
114 r
103 g
97 a
117 u

symbol is defined in api.h as unsigned short.

Andreas

~

----- Original Message -----
From: "Richard Boulton" <richard@tartarus.org>
To: "Andreas Jung" <andreas@zope.com>
Cc: "Snowball discussion list" <snowball-discuss@lists.sourceforge.net>
Sent: Saturday, May 25, 2002 09:37
Subject: Re: [Snowball-discuss] 16 bit characters in Snowball

> On Fri, 2002-05-24 at 20:47, Andreas Jung wrote:
> > Seems that the problem is still not solved.
> > I re-created all stemmers with and without -w option and in
> > both cases snowball produced identical sources. Any ideas why?
>
> Yes, -w doesn't change the output. What it does is allow snowball
> programs to use character values in the range 0-65535 instead of 0-255.
>
> A snowball program which can be generated successfully without -w will
> not be affected by use of -w. However, a snowball program which uses
> characters out of the range 0-255 will not be generated successfully
> without -w.
>
> If you're using -w to generate snowball output, you must also set
> the typedef of "symbol" in api.h to something appropriate when you
> compile the sources: see the comment at the start of api.h
>
> Note that using -w and setting the size of symbol still doesn't
> guarantee that the snowball program is using a 16 bit character set: see
> the russian/stem.sbl file for an example: by default it uses KOI8-R (in
> which all the character codes fit in one byte), but if you change the
> comments around you can make it use Unicode instead.
>
> --
> Richard
>
> _______________________________________________________________
>
> Don't miss the 2002 Sprint PCS Application Developer's Conference
> August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST