[Snowball-discuss] FW: [Xapian-discuss] Some performance questions

From: Arjen van der Meijden (arjen@glas.its.tudelft.nl)
Date: Fri May 30 2003 - 11:05:01 BST


Hi list,

While monitoring the logs from our xapian/omega-based searchengine I
noticed a few odd things in the amount of results. We use the dutch
snowballstemmer.

What alerted me was that a query with 'dane-elec' (which is split into
'dane' and 'elec') got stemmed to: 'dan' and 'elec' and had with an
OR-query over 400k results.
This is a bit odd, since 'dan' is a stopword (which you can find on your
own stoplist: http://snowball.tartarus.org/dutch/stop.txt, it means both
'than' and 'then') and by that stemming it ruins the query results,
luckily the above term is translated to be a phrase-search.

Another such, less important, example is: 'ene' which (de ene -> the
one, die ene -> that one) which gets stemmed to 'en' -> a/an

I noticed some other odd stuff, as you might know we Dutch people put
many different meanings into single or similar words like. That
behaviour results in pretty bad transformations so know and then:

Helen -> to heal, to buy stolen stuff from the thief
Ik heel -> I heal, I'm buying "".
'Heel' can also mean 'whole' or 'all'.
De hele appel -> the whole apple
Heel de tijd -> all the time

Anyway, those words are stemmed to: 'hel' which is Dutch for 'hell'

For the second problem, I don't think there is any solution. Unless you
manage to write a piece of software that can really understand a
language and by that removing stopwords (in the context of course) and
stemming the words correctly if necessary.

For the first problem, I think the obvious correction is not to stem
words to a stopword. I just don't know if it's a correct solution ;)

Regards,

Arjen

> Olly Betts wrote:
>
> > > > msi nforce2 dane-elec
> > > > dane should perhaps not be stemmed to 'dan' which is a
> stopword in
> > > > dutch
> > >
> > > Is this with the Dutch stemmer?
>
> > I certainly hope so, I replace all the "english" with
> "dutch" before
> > compiling.
>
> I've just made that a little easier, pending making it
> properly configurable - Omega's query.cc now has "english"
> once, rather than 4 times!
>
> > > What do "dan" and "dane" mean? Is the issue that "dane" has
> > > multiple meanings?
>
> > No, 'dan' is both the english 'then' and 'than' (more or less a
> > stopword), 'dane' is from dane-elec:
> > http://www.dane-elec.fr/index_en.htm which is a pretty large
> > memory-module seller, at least they are in Holland :)
> >
> > But this might just be an unfortunate coincidence, I
> haven't run into
> > that many stemming-problems.
>
> It's probably worth reporting the issue to the snowball list
> (our stemmers come from the snowball project):
>
> snowball-discuss@lists.tartarus.org
>
> Martin Porter may be able to offer more insight.



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:44 BST