Re: [Snowball-discuss] New, and a couple of questions

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Wed Mar 10 2004 - 09:52:02 GMT


Sandy,

Your assumptions about memory allocation are correct. To change the initial
creation size from 1 to 1000 (say) you alter

#define CREATE_SIZE 1

to

#define CREATE_SIZE 1000

in the q/utilities.c module. But you must not imagine size changes happen
often. The buffers are incresed in increase_size(...) in q/utilities.c, and
this is never called more than twice for any of the sample vocabularies I
use in the tests.

There is no fast way of discovering if a word has been stemmed. You could
set a flag in the various functions of q/utilities.c that alter z->p, but
this is not a general solution, since Snowball can use auxiliary strings
that may be altered while the main string remains unaltered - although none
of the current stemmers would do that. So you have to use strcmp or equivalent.

I find on my machine,

    for (i = 0; i < one_hundred_million; i++)
        strcmp("honorificabilitudinitatibus",
               "honorificabilitudinitatibus");

takes about 24 secs, of which 1 sec is spent in the mechanics of the loop. I
suppose the words are tested from the beginning, and

    for (i = 0; i < one_hundred_million; i++)
        strcmp("honorificabilitudinitatibus",
               "Honorificabilitudinitatibus");

by contrast takes about 4 secs. But my machine is fairly slow: on Richard
Boulton's machine it would take a tenth of that time.

Martin



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST