Re: [Snowball-discuss] 8-bit and 16-bit characters support

From: Eugen Bushuev (bu@cisarte.com)
Date: Wed Jun 04 2003 - 13:32:01 BST


btw, cyrillic letters in utf-8:

stringdef a decimal '45264'
stringdef b decimal '45520'
stringdef v decimal '45776'
stringdef g decimal '46032'
stringdef d decimal '46288'
stringdef e decimal '46544'
stringdef zh decimal '46800'
stringdef z decimal '47056'
stringdef i decimal '47312'
stringdef i` decimal '47568'
stringdef k decimal '47824'
stringdef l decimal '48080'
stringdef m decimal '48336'
stringdef n decimal '48592'
stringdef o decimal '48848'
stringdef p decimal '49104'
stringdef r decimal '32977'
stringdef s decimal '33233'
stringdef t decimal '33489'
stringdef u decimal '33745'
stringdef f decimal '34001'
stringdef kh decimal '34257'
stringdef ts decimal '34513'
stringdef ch decimal '34769'
stringdef sh decimal '35025'
stringdef shch decimal '35281'
stringdef " decimal '36049'
stringdef y decimal '35793'
stringdef ' decimal '35537'
stringdef e` decimal '36305'
stringdef iu decimal '36561'
stringdef ia decimal '36817'

Martin Porter wrote:

>>Return-Path: <bu@lucky.net>
>>Delivered-To: martin_porter@SoftHome.net
>>Date: Wed, 04 Jun 2003 10:08:30 +0300
>>From: Eugen Bushuev <bu@lucky.net>
>>User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
>>
>>
>Gecko/20021120 Netscape/7.01
>
>
>>X-Accept-Language: ru, en-us
>>To: Martin Porter <martin_porter@SoftHome.net>
>>CC: Oleg Bartunov <oleg@sai.msu.su>
>>Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
>>References: <courier.3EDD8B20.00003D70@softhome.net>
>>X-Verify-Sender: verified
>>
>>Hi.
>>This question was risen by me. You can get bit of russian utf-8 text at
>>http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html.
>>
>>About you advice - i can't find neither in russian or english .sbl
>>something similar "goto v[owel]" directives. I tried to add current
>>character size to the SN_env structure and replace sizeof(symbol) with
>>z->sizeOfChar in all memory allocation procedures. Also i tried to play
>>with incrementing z->c, but it gave me nothing since i alsmost don't
>>understood how does it work.
>>
>>I'm trying to make it out because i need tsearch to work with UTF-8.
>>UTF-8 is used because postgres uses it as "Unicode", and besides this i
>>need to process data in several languagies, at least English, Russian
>>and Ukrainian.
>>
>>And, btw, why 2 and 3 characters? I thought that english text uses 1
>>byte and russian - 2 bytes...
>>
>>Martin Porter wrote:
>>
>>
>>
>>>Oleg,
>>>
>>>No, Snowball is either set up for 1 byte character use, or 2 byte character
>>>use, but it has occurred to me that implementing the stemmers on utf-8 data
>>>may not be so difficult, even with no changes to the Snowball compiler.
>>>
>>>If you treat utf-8 data as a pure byte stream of characters (so one utf-8
>>>character corresponds to 2 or 3 bytes) the stemmers almost work, but the
>>>thing that goes wrong is the single character tests for characters in a
>>>certain class. So one would have to replace
>>>
>>> goto vowel // vowel defined by 'define vowel '...'
>>>
>>>by
>>>
>>> goto among ('a' 'e' 'i' 'o' 'u')
>>>
>>>or more precisely
>>>
>>> goto among ('[a]' '[e]' ... )
>>>
>>>where [a] etc are macros defining the vowels as utf encoded byte sequences.
>>>
>>>Perhaps that is how all the stemmers should have been written.
>>>
>>>Can you point me to some plain text somewhere in the web that gives a bit of
>>>russian in utf-8 encoded Unicode ? I might play around with this idea.
>>>
>>>Martin
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>Snowball-discuss mailing list
>>>Snowball-discuss@lists.tartarus.org
>>>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>
>>>
>>>
>>>
>>>
>>--
>>? ?????????, ?.??????.
>>
>>
>>
>>
>>
>
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss@lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
>

-- 
? ?????????, ?.??????.



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST