[Snowball-discuss] problems with Finnish

From: Alex Murzaku (lists@lissus.com)
Date: Sat Sep 21 2002 - 00:10:01 BST


Hello!

I am trying to integrate the java snowball stemmers to lucene (a search
engine java library). Besides Finnish and Russian, all the other
languages are working perfectly. While I know that I have char set
issues with Russian which eventually I will resolve, the problem with
Finnish seems to be more subtle because it works for most of the words
but fails in some of them. The first one that comes (and therefore fails
shows up in the unit tests) is aarteeseen. This is what junit reports:
    expected:<aart> but was:<aartees>

The output file contains other differences from the reference output
file. This doesn't happen with the C executable. It seems the problem
shows up only in Java. Any similar experiences? The correct result seems
to be that all the following words should have gone to "aart" which most
of them do... By the way, I don't know any Finnish beside the fact that
it is an agglutinative language.
aarteeksi
aarteen
aarteena
aarteensa
aarteeseen
aarteet
aarteiden
aarteilla
aarteillaan
aarteineen
aarteisiin
aarteista
aarteita
aarteitaan

-- 
Alex Murzaku
_______________________________________________________
 LISSUS llc  alex at lissus.com  http://www.lissus.com            



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST