[Snowball-discuss] Minor bug in utf-8 handling

From: Olly Betts (olly@survex.com)
Date: Tue Feb 13 2007 - 17:01:45 GMT

Next message: Martin Porter: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Previous message: Olly Betts: "Re: [Snowball-discuss] More patches"
Reply: Martin Porter: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Reply: Richard Boulton: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Reply: Martin Porter: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Reply: Richard Boulton: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I think I've spotted a bug in the handling of 3 byte utf-8 sequences
while reading the code. Both get_utf8 and get_b_utf8 fetch the third
byte with *p when they should use p[c]:

http://oligarchy.co.uk/xapian/patches/snowball-3byte-utf8-bugfix.patch

In current stemmers, this is probably harmless, as the characters in use
in the languages snowball has stemmers for encode as one or two byte
utf-8 sequences.

I also improved the comment before skip_utf8.

Cheers,
Olly

Next message: Martin Porter: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Previous message: Olly Betts: "Re: [Snowball-discuss] More patches"
Reply: Richard Boulton: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Reply: Richard Boulton: "Re: [Snowball-discuss] Minor bug in utf-8 handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST