At 08:43 PM 5/15/02 -0400, Andreas Jung wrote:
>Do you speak about 16 bit fixed encoding? I only know USC-2 that
>fulfills this requirement. Is it that what you mean?
>
>Andreas
Andreas,
I take it you mean UCS-2, not USC-2.
Yes, Snowball expects a typdef of 'symbol' to 'unsigned char' (one byte), or
But of course none of the Snowball stemmers recognise Unicode characters
This is precisely what UTF-16 does. Characters over 0xFFFF are split into
Essentially, each Snowball stemmer has a fixed list of vowels, and anything
'unsigned short' (two bytes or more), or 'unsigned long' (4 bytes or more)
... so 'unsigned char' can be used for UCS-1, 'unsigned short' for UCS-2.
above 32K, let alone 64K, so you can encode high-value characters as a
sequence of two-byte characters, and pass them into Snowball compiled with
'symbol' as 'unsigned short'.
two, each of which is in a spare range of unicode. Snowball would therefore
handle UTF-16 characters okay in this scheme.
else is assumed to be a consonant. A character above 0xFFFF would therefore
be treated as a consonant list.
It would have been possible to codegenerate stemmers in C that use UTF-8
direct, but (a) this would not have extended to Java, with its 16-bit
characters and (b) the slowness of character cursor movement (currently
implemented as a simple z->c++; or z->c--;) would probably have made the
final stemmers worse than bearing the overhead of translating UTF-8 to and
from UCS-2 for each call, always assuming that is what you have to do.
Martin
_______________________________________________________________
Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST