Hello,
Consider the following snowball program (test.sbl).
stringescapes {}
externals ( main )
stringdef a hex '03b1' // Greek Small Letter Alfa
stringdef i` hex '03af' // Greek Small Letter Iota With Tonos
define main as '{a}{i`}'
When test.sbl is compiled with
snowball test.sbl -u -o test
generates
...
static symbol s_0[] = { 0xCE, 0xB1, 0xCE, 0xAF };
...
as expected (the byte sequence 0xCE 0xB1 is Greek Small Letter Alfa in UTF-8).
However, when compiled with
snowball test.sbl -w -o test
the generated code reads
...
static symbol s_0[] = { 0xB1, 0xAF };
...
I am running Snowball in a Win2K server. I have compiled the Snowball compiler
with (cygwin) gcc 3.4.4 as well as with Microsoft C/C++ Compiler 13.10.3077. The
results are identical in both cases.
The following modification in method wlitarray (line 94 of source file generator.c)
for (j=8*sizeof(symbol)-4; j>=0; j-=4) wh(g, ch >> j & 0x0f);
along with a redefinition of symbol as
typedef wchar_t symbol; // unsigned short works as well
fixed the problem: now
snowball test.sbl -w -o test
generates
...
static symbol s_0[] = { 0x03B1, 0x03AF };
...
However, I am wondering if UTF-8 is the preferred internal encoding for ANCI C stemmers.
Best regards,
Marios.
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST