Re: [Snowball-discuss] Unicode support

From: Martin Porter (martin_porter@softhome.net)
Date: Thu May 16 2002 - 09:39:18 BST


At 08:43 PM 5/15/02 -0400, Andreas Jung wrote:
>Do you speak about 16 bit fixed encoding? I only know USC-2 that
>fulfills this requirement. Is it that what you mean?
>
>Andreas

Andreas,

I take it you mean UCS-2, not USC-2.

Yes, Snowball expects a typdef of 'symbol' to 'unsigned char' (one byte), or
'unsigned short' (two bytes or more), or 'unsigned long' (4 bytes or more)
... so 'unsigned char' can be used for UCS-1, 'unsigned short' for UCS-2.

But of course none of the Snowball stemmers recognise Unicode characters
above 32K, let alone 64K, so you can encode high-value characters as a
sequence of two-byte characters, and pass them into Snowball compiled with
'symbol' as 'unsigned short'.

This is precisely what UTF-16 does. Characters over 0xFFFF are split into
two, each of which is in a spare range of unicode. Snowball would therefore
handle UTF-16 characters okay in this scheme.

Essentially, each Snowball stemmer has a fixed list of vowels, and anything
else is assumed to be a consonant. A character above 0xFFFF would therefore
be treated as a consonant list.

It would have been possible to codegenerate stemmers in C that use UTF-8
direct, but (a) this would not have extended to Java, with its 16-bit
characters and (b) the slowness of character cursor movement (currently
implemented as a simple z->c++; or z->c--;) would probably have made the
final stemmers worse than bearing the overhead of translating UTF-8 to and
from UCS-2 for each call, always assuming that is what you have to do.

Martin

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:42 BST