[Snowball-discuss] Problem with PySnowballStemmer

From: Patrick Mézard (pmezard@gmail.com)
Date: Sat Jan 21 2006 - 18:09:50 GMT


Hello,
First, thank you Weongyo Jeong for providing updated python bindings, I
was definitely looking for them.

However, I fail to make them work with UTF-8 inputs:
"""
# -*- coding: iso-8859-1 -*-
import SnowballStemmer

encodings = [
     ('UTF_8', 'utf8'),
     ('ISO_8859_1', 'iso-8859-1'),
]

for sn_enc, py_enc in encodings:
     s = SnowballStemmer.SnowballStemmer().new('french', sn_enc)
     #This is a 'latin small letter e acute' at the end of the word.
     u = unicode('pitié', 'iso-8859-1').encode(py_enc)
     print sn_enc, ':', repr(u), '=>', repr(s.stem_str(u))
"""

outputs:
"""
UTF_8 : 'piti\xc3\xa9' => 'piti\xc3'
ISO_8859_1 : 'piti\xe9' => pit
"""

The UTF-8 version returns an invalid UTF-8 sequence. I am completely new
to Snowball and I have just seen the announce according to which Unicode
support was added last year. Until now I failed to find reliable
information about how this is done, even when looking in the code:

1- There is bunch of stemming files in the bindings sources, including
"stem_UTF_8_french.c". I suppose it was generated from a Snowball
stemming file. Does the "UTF_8" means the input strings are UTF-8 bytes
sequences ? I suppose so.

2- Reading the ML I thought UTF-8 was implemented by translating inputs
to UCS-2 first then stemming them. I cannot find anything looking like
an UTF-8 decoder/encoder. Besides, "symbol" is defined as an "unsigned
char". Are the bindings interpreting UTF-8 strings directly?

3- If [2], then AFAIK UTF-8 is nothing else than an encoding layer on
top of Unicode code values. How does the stemmer handle normalized
forms? Are there any expectations about them? I tried to send the same
UTF-8 word in NFD form instead of the default python one (which should
be NFC or NFKC), but it changed nothing.

The bindings were compiled and tested with:
"""
ActivePython 2.4.2 Build 248 (ActiveState Corp.) based on
Python 2.4.2 (#67, Oct 30 2005, 16:11:18) [MSC v.1310 32 bit (Intel)] on
win32
"""

Did I miss something obvious ?
Thank you for any idea about this.

--
Patrick Mézard



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:47 BST