Re: [Snowball-discuss] Unicode representation (formerly: Regarding a unicode version of Snowball)

From: Martin Porter (martin_porter@softhome.net)
Date: Thu Nov 29 2001 - 11:12:37 GMT


Okay, let's summarise the issues (Richard Boulton will be posting this to a
related mailing list).

Richard and I have no experience working with Unicode. In computing terms
the ideas are quite simple, but we need to know we are doing it the standard
way.

A Unicode character sequence can be represented in ANSI C by an array of
type 'short *'. More precisely, by 'unsigned short *', since, like ASCII
characters, there seems no advantage in allowing Unicode characters to have
negative values. USHRT_MAX in limits.h is defined as 65,535 in the ANSI C
Standard, so we know 16 bit characters will always be representable in an
unsigned short array. A question does arise: will a Unicode character ever
exceed 16 bits? UTF-8 encoding certainly allows for the representation of
codes with more than 16 bits, wich I suppose is an advantage in certain
circumstances, and 64K is only just adequate to represent Chinese (50,000 or
more characters) and everything else currently in there. Whether Unicode
assignments might exceed 64K I do not know, but it is not impossible.

The arrangement of data in an 'unsigned short' array is machine dependent (I
had forgotten that in the original question to Archie.) Certainly that is
true at the theoretical level, and at the practical level the killer is that
on the 80x86 range, each short is two bytes with the least significant byte
first; on the 68000 range each short is two bytes with the least significant
byte second. To make data portable in its form on hard disk it has to be
converted into unsigned short array form at some point after being read in.
And unsigned short arrays can't be portably represented in C strings of the type

     char * unicode0 = "\0" "a" "\0" "b" "\0" "c" ...;

In any case there is an alignment problem: C strings don't necessarily begin
on an even byte boundary.

There are therefore of two ways of altering Snowball to handle Unicode.

a) Go for a 2 byte representation, and replace the byte arrays by short
arrays. To do this literal strings in the generated code have to be
eliminated. "abc" needs to be represented as

    static unsigned short word_abc[] = {'a', 'b', 'c');

and word_abc is used in place of the literal string - or something along
those lines.

b) Go for a UTF-8 representation. The code generated from Snowball is then
almost exactly the same. The only tricky bit is the concept of advancing the
cursor by 1 place ('next') or n places ('hop n'). Advancing the cursor 1
place is used implicitly in 'gopast' and 'goto'. The code 'z->c++;' or 'z->c
+= n;' needs to be elaborated. In Snowball, most string processing is done
backwards. But UTF-8 encoded data can be processed backwards (I believe).

Unfortunately the definition of Snowball changes slightly. The implication
is that

    setmark X hop n setmark Y

leaves Y-X = n, but that would not be true if the character size varied. A
few of the stemmers would need adjusting. Howver, that is not too important.

The UTF-8 representation is very attractive if that is the form the data has
on disk.

----

There is an issue with bitmaps. If you set up a character class, it establishes a bitmap for the range of possible characters, of size equal to M-m, where M is the largest code and m the smallest code of characters in the class. So testing for 'letter' in Russian gives a bitmap of size 32 bits (4 bytes), assuming the 32 letters occur together in their code table. With Unicode you could get bigger bitmaps because M may be much bigger than m. But I had thought this through some months back and decided that, at least for the stemmer programs written so far, the bitmaps would not be of excessive size, since the codes we are dealing with occur in the lower regions of the Unicode tables.

Martin Porter

Archie's last message:

>Thanks for the response Martin. > >I believe that short* would be more convenient then UTF-8 encoding. I must >admit that all of my Unicode coding experience is limited to Windows2000. >Here, almost all the functions that take Unicode characters are expecting a >short* buffer. And it also expects the most significant byte to be first in >the way you assumed. > >Archie >

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________ VirusChecked by the Incepta Group plc _____________________________________________________________________



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:40 BST