[Snowball-discuss] Turkish Stemmer

From: Evren Kapusuz (evren.kapusuz@gmail.com)
Date: Wed Jan 17 2007 - 14:11:40 GMT


Hello,
I am working on a project for indexing and searching documents written in
Turkish.
I couldn't find a stemmer for Turkish language, so I decided to develop one.
Because of the agglutinative nature of the language, developing a
stemmer for
Turkish isn't an easy task. I found Snowball very useful for developing
stemmers for
languages having complex morphological structure. I was able to learn
the features
of the language in a very short time and develop a stemmer for Turkish
language.
I'd like to contribute it to Snowball.
In the attachments, you can find,
1. the Snowball program I developed for stemming Turkish words (
TurkishStemmer.sbl).
2. the Java code generated by the Snowball compiler (TurkishStemmer.java)
3. Unit test code for testing TurkishStemmer.stem() (TestTurkishStemmer.java,
is runnable in Lucene framework)
4. A paper describing how I developed the stemmer (StemmingTurkishWords.doc)
5. An output file demonstrating what the outputs of the stemmer are for some
common cases for Turkish. (test.out)

ps: Thanks for the paper of Gulsen Eryigit and Esref Adali. The stemming
algorithm is based on the paper
"An Affix Stripping Morphological Analyzer for Turkish" (Proceedings of the
IAESTED International Conference
ARTIFICIAL INTELLIGENCE AND APPLICATIONS, February 16-18,2004, Innsbruck,
Austria.

Best regards,
Evren Kapusuz Cilden













This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST