[Snowball-discuss] Re: malay stemming

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Wed Aug 18 2004 - 08:15:36 BST


Iskandar,

I am posting your query and this reply to
snowball-discuss@lists.tartarus.org, just in case anyone else had something
useful to add.

I myself know nothing about Malay. Work has been done on stemming in Malay,
see for example the reference at

http://citeseer.ist.psu.edu/681191.html

I have not seen this paper, but you have to be warned that only rarely do
such reports describe the stemming process precisely enough for it to be
copied as an algorithm, coded in php or whatever. Even so, it might be
useful for you to contact the authors.

A stemming algorithm could be developed to remove prefixes and suffixes in
equal measure. http://snowball.tartarus.org has many examples of techniques
in stemming, but non-European languages are not covered. You should bear in
mind that developing a stemmer is a project in itself, and to do it as part
of a piece of work to build a search engine might be undertaking too much.

If you make progress, I would be most interested to hear how you get on,

Martin Porter

At 13:53 18/08/2004 +0800, Iskandar wrote:
>
>Im student of Faculty Sc Comp Universiti Teknolgi Malaysia. I need to build
search engine for my final project. I have a problem in stemming. Actually
if my search engine use english language there is no problem. But for this
project i need to build search engine for malay language.
>
>The usage of affixes in english ( and similar language )is far less complex
than in language such Malay.Where the stripping of suffixes alone would not
be sufficient for retrieval purpose. Thus, focusing on malay, the root of
the words makanan (food), pemakan (eater), dimakan (being eaten), pemakanan
(nutrition) and termakan (accidentally eaten) is makan(eat), and a simple
stemmer that removed only suffixes would yield five different stems : makan,
pemakan, dimakan, pemakan and termakan .
>
>It is clear that is not possible to stem Malay text effectively without
considering the removal of prefixes as well as suffixes.
>
>So can you give me an idea how to stripping the prefixes in malay language,
and I used PHP code to developed the search engine. And database MySQL.
>
>Regard. Thank You.
>



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:46 BST