Sandy,
Your assumptions about memory allocation are correct. To change the initial
creation size from 1 to 1000 (say) you alter
#define CREATE_SIZE 1
to
#define CREATE_SIZE 1000
in the q/utilities.c module. But you must not imagine size changes happen
often. The buffers are incresed in increase_size(...) in q/utilities.c, and
this is never called more than twice for any of the sample vocabularies I
use in the tests.
There is no fast way of discovering if a word has been stemmed. You could
set a flag in the various functions of q/utilities.c that alter z->p, but
this is not a general solution, since Snowball can use auxiliary strings
that may be altered while the main string remains unaltered - although none
of the current stemmers would do that. So you have to use strcmp or equivalent.
I find on my machine,
for (i = 0; i < one_hundred_million; i++)
strcmp("honorificabilitudinitatibus",
"honorificabilitudinitatibus");
takes about 24 secs, of which 1 sec is spent in the mechanics of the loop. I
suppose the words are tested from the beginning, and
for (i = 0; i < one_hundred_million; i++)
strcmp("honorificabilitudinitatibus",
"Honorificabilitudinitatibus");
by contrast takes about 4 secs. But my machine is fairly slow: on Richard
Boulton's machine it would take a tenth of that time.
Martin
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST