[Snowball-discuss] ANSI C Generator feature

From: Marios Sintichakis (mms@archetypon.com)
Date: Fri Jul 15 2005 - 13:32:50 BST


Hello,
 
Consider the following snowball program (test.sbl).
 
stringescapes {}
externals ( main )
stringdef a hex '03b1' // Greek Small Letter Alfa
stringdef i` hex '03af' // Greek Small Letter Iota With Tonos
define main as '{a}{i`}'
 
When test.sbl is compiled with
 
snowball test.sbl -u -o test
 
generates
 
...
static symbol s_0[] = { 0xCE, 0xB1, 0xCE, 0xAF };
...
 
as expected (the byte sequence 0xCE 0xB1 is Greek Small Letter Alfa in UTF-8).
However, when compiled with
 
snowball test.sbl -w -o test
 
the generated code reads
 
...
static symbol s_0[] = { 0xB1, 0xAF };
...

I am running Snowball in a Win2K server. I have compiled the Snowball compiler
with (cygwin) gcc 3.4.4 as well as with Microsoft C/C++ Compiler 13.10.3077. The
results are identical in both cases.
 
The following modification in method wlitarray (line 94 of source file generator.c)
 
for (j=8*sizeof(symbol)-4; j>=0; j-=4) wh(g, ch >> j & 0x0f);
 
along with a redefinition of symbol as
 
typedef wchar_t symbol; // unsigned short works as well
 
fixed the problem: now
 
snowball test.sbl -w -o test
 
generates
 
...
static symbol s_0[] = { 0x03B1, 0x03AF };
...

However, I am wondering if UTF-8 is the preferred internal encoding for ANCI C stemmers.
 
 
Best regards,
Marios.



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST