[Snowball-discuss] Unicode and python bindings

From: Patrick Mézard (pmezard@gmail.com)
Date: Tue May 16 2006 - 13:39:05 BST


Trying to solve issues I raised in a previous post
(<http://thread.gmane.org/gmane.comp.search.snowball/772/focus=772>), I
finally rewrote parts of the original Weongyo Jeong python bindings to
fit my needs. The main change is the module interface now consumes
python Unicode strings (UTF-16) instead of native strings. The idea is
that code dealing with multiple languages usually unifies first the
documents encodings into Unicode before passing them to other modules,
including stemming. With the original bindings, since I failed to use
the UTF-8 interface, I had to convert back from Unicode to specific
encodings which was at best a pain, at worst impossible.

The new version is temporary available there:
<http://perso.wanadoo.fr/patrick.mezard/dev/pysnowball-0.0.2.zip> and I
can provide a copy of the darcs (<http://abridgegame.org/darcs/>)
repository I used to rewrite my branch.

I think it still needs to be reviewed before any release (I am far from
being a python C extension expert), even if it passes the few tests I
could imagine.

What's your opinion about this?

Patrick Mézard

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST