RE: [Snowball-discuss] More patches

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Fri Feb 16 2007 - 13:05:18 GMT


As you point out, some code will already change to lower case. Therefore it should not be a standard part of the stemmer. This is primarily for performance reasons. It would be "nice to have" if Snowball provided a change to lower case feature that could be optionally invoked.

Hopefully helpfully yours,
Steve

-- 
Steve Tolkin    Steve . Tolkin at FMR dot COM   508-787-9006
Fidelity Investments   82 Devonshire St. M3L     Boston MA 02109 
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates. 

-----Original Message----- From: snowball-discuss-bounces@lists.tartarus.org [mailto:snowball-discuss-bounces@lists.tartarus.org] On Behalf Of Olly Betts Sent: Friday, February 16, 2007 7:06 AM To: Richard Boulton Cc: snowball-discuss@lists.tartarus.org Subject: Re: [Snowball-discuss] More patches

[some snipped]

I wonder if the algorithms should perform lowercasing for you. In general it's a required preprocessing step for the stemmers to work correctly, so most users will need to implement the lower casing for themselves (except perhaps for applications where the input is always lowercase already).

The problem I can see is that to do it correctly for all non-ASCII characters requires fairly large tables, and doing it just for ASCII letters probably isn't really sufficient. Perhaps it's only necessary for characters the stemmers check for though. Thoughts?

Cheers, Olly

_______________________________________________ Snowball-discuss mailing list Snowball-discuss@lists.tartarus.org http://lists.tartarus.org/mailman/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST