[Snowball-discuss] RE: Snowball-discuss digest, Vol 1 #75 - 1 msg

From: Sven Neumann (webmaster@yellowpagesmalta.com)
Date: Tue Jun 24 2003 - 09:54:01 BST


Dear Lemma,

When I started doing research into stemming (and eventually found the
porter stemmer) I had a good run at CiteSeer.com searching for documents
mentioning and citing the Porter Stemmer. Just about any modern paper on
stemming will mention the Porter Stemmer. I would urge you to start
reading up the papers on existing research, so you get a solid feel for
methodologies and techniques.

From you message, you seem to both mention a "sufic-file" which ousnds
like look-up based stemming, as well as context sensitivity. I believe
the rule based setup of the Porter stemmer may well be one off the most
effective for a variety of languages. I can only agree with Martin in
that you should try to code your sufixes and context sensitive rules
into the snowball stemmer. Although, personally I found the syntax of
snowball a bit clunky and perhaps hard to get started on. But it will be
well worth it. Try settign out the rules in a pseudo code fashion. Also
try to identify the commonalities of your contextual rules and to
minimise the information you need about the context. Like the measure,
word length, preceding vowel/consonatn. So you don'T have to hand code
long complex checks for each rule.

I wish you the best of luck with your stemmer. I cannot further help you
with implementation (which is my stronger side) as I don't have the
rules, but I strongly suggest you try to use the snowball syntax as your
"rule-design" language.

Best Regards,

Sven Neumann
---------------------------------
webmaster@goldenpagesmalta.com
http://www.GoldenPagesMalta.com
---------------------------------

-----Original Message-----
From: snowball-discuss-admin@lists.tartarus.org
[mailto:snowball-discuss-admin@lists.tartarus.org] On Behalf Of
snowball-discuss-request@lists.tartarus.org
Sent: 24 June 2003 08:15
To: snowball-discuss@lists.tartarus.org
Subject: Snowball-discuss digest, Vol 1 #75 - 1 msg

Send Snowball-discuss mailing list submissions to
        snowball-discuss@lists.tartarus.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.tartarus.org/mailman/listinfo/snowball-discuss
or, via email, send a message with subject or body 'help' to
        snowball-discuss-request@lists.tartarus.org

You can reach the person managing the list at
        snowball-discuss-admin@lists.tartarus.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Snowball-discuss digest..."

Today's Topics:

   1. Re: Reques of Advise (Martin Porter)

--__--__--

Message: 1
To: lemma lessa <lemmalessa1974@yahoo.com>
From: martin_porter@softhome.net (Martin Porter)
Cc: snowball-discuss@lists.tartarus.org
Date: Mon, 23 Jun 2003 02:45:13 -0600
Subject: [Snowball-discuss] Re: Reques of Advise

Lemma Lessa,

I am glad to hear you are making progress.

The Porter stemmer, as written, takes its input from the list of files
on the command line, and sends its output to stdout. These ideas will be
familiar to you if you are using Unix or Linux or Windows NT, but you
may need to adapt the program slightly for other operating systems. What
computer are you developing your work on?

So if you compile the program to a module called STEM, you can run it by
typing

    STEM source.text >stemmed.txt

(>stemmed.txt redirects stdout to the file stemmed.txt)

Reading through your note, it seems that the problem you are having is
encoding the stemming algorithm proper, which, although you describe it
as non-functional, is in fact a set of rules, very much like the Porter
stemmer, from which the algorithm might be coded up in a functional way.
The ANSI C version of the Porter stemmer might be a useful model to
follow, but that is only one of several approaches.

The impression I have is that you are a bit short of technical
assistance in this work, and it would be a pity if it got stuck when you
have put so much effort into the actual algorithm. Have the seen the
stemming algorithm rules at snowball.tartarus.org ? For example, see the
pages

http://snowball.tartarus.org/french/stemmer.html
http://snowball.tartarus.org/russian/stemmer.html

If you could provide the algorithm in this exact form, plus a sample
vocabulary, I might be able to help you by developing a Snowball
stemmer, and could then send you an ANSI C module that did what you
required. You should however talk this idea over with your research
supervisor.

I hope you will not mind if I post this email on snowball-discuss -
other replies can often be useful. I have edited out the sections that
describe your stemming rules, since this your own research work.

 
At 23:19 22/06/2003 -0700, lemma lessa wrote:
>
>Dear sir,
>
>I am student at Addis Ababa University, Ethiopia. I am doing a reserch
>on
developing a stemming algorithm for one of the local languages.My
stemmer follows an iterative approach. I used ANSI C to code my
algorithm. I have decided to adopt the Porter stemmer. But I faced
problems pointed here under and came to you hoping that you will help
me. I prefer to present my questions as follws:
>
>I have found the porter algorithm in ANSI C version but when I run it,
>it
responds that 'File Not Found'. Assuming that the name of the file to be
stemmed is "source.txt" where shall I save this file so that the program
can get it?
>
>After stemming the file, how does it save the stem dictionary?(By what
>name
and where?)
>
>How my stemmer works: My stemmer reads information from three files:
>suffix
file, stopword file and source file. It has three modules:one takes care
of all matters of the suffix file, the second one deals with stopwords,
and the third one deals with the actual suffix stripping task. It works
in such a way that it first reads unstemmed word from the source file;
then it reads entries from the stopword file and compare it with the
word read from the main file. If the word exists in the stopwords list,
the program reads the next word from the souce file. Otherwise, suffix
file is opened and suffixes are stripped, if any. Finally, conditions
are checked against the final resulting stem and necessary action is
implemented, if applicable. (please, see the conditions/actions below).
>
>Problem faced: The third module (the one that deals with the suffix
stripping) is not functional. Please, help me !!!!! as to how to adopt
the porter stemmer based on the conditions/actions given below. Assume
the name of the suffix file is 'Suffix.txt" and the stopword file is
"stopword.txt".
>
>
>
>The only Conditions/rules considered by the stemmer
>
>The stemmer developed for stemming Wolaytta text is context sensitive
>one.
This decision is made mainly to get better performance result from the
stemmer. There are two context-sensitive actions employed in the stemmer
in process. These are:-
>

[Lemma Lessa's steming rules follow]

>Please, when this message arives at your desk, inform me back that it
>is
reached.
>
>For your favorable reply, I remain.
>
>
>
>Yours,
>
>Lemma

--__--__--

_______________________________________________
Snowball-discuss mailing list Snowball-discuss@lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

End of Snowball-discuss Digest



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:45 BST