Links to resources
In swift succession, we received in 2006 two stemmers for Romanian
written in Snowball.
Here is the original correspondence,
FROM Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
TO snowball-discuss@lists.tartarus.org
ON Wed Jun 07 08:56:44 2006
[Snowball-discuss] romanian stemmer
Hello everyone,
my name is Erwín Glockner, I'm a student of computational linguistics in
Heidelberg, Germany. Together with my fellow students Doina Gliga and
Marina Stegarescu we started to write a romanian stemmer in Snowball.
We planned to finish the stemmer until end of this month. We would be
happy if the stemmer would be accepted as part of the Snowball-distribution.
There is still some work to do, e.g. evaluating the stemmer, making a
stopwords-list, unicode support, etc. After finishing this we will send
you our stemmer with the corresponding files, but I couldn't find any
email adress to whom the stemmer should be sent to.
Could please someone tell me the address(es)?
With kind regards,
E. Glockner, D. Gliga, M. Stegarescu.
FROM Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
TO richard@lemurconsulting.com
TO martin.porter@grapeshot.co.uk
ON Tue Jul 18 19:43:39 2006
romanian stemmer
Dear Mr. Porter, dear Mr. Boulton,
we finally finished the Romanian stemmer. Unfortunately evaluation took
more time than expected.
However, it was an interesting experience creating the stemmer, and we
are happy to send you the result of our work.
The attachment-file is a Tarball-zipped file with (hopefully) all files
needed. The files and the stemmer as well are encoded in UTF-8. Please
inform us if something is missing.
We would be happy if the Romanian stemmer would be accepted and
integrated into the official Snowball distribution. We agree of course
to license the stemmer under the same terms as the existing snowball
software.
We're looking forward to hear from you soon.
With kind regards,
Marina, Doina and Erwin.
Attachment: [romanian1.tgz]
FROM Irina Tirdea <irina.tirdea@gmail.com>
TO snowball-discuss@lists.tartarus.org
TO richard@lemurconsulting.com
TO martin.porter@grapeshot.co.uk
ON Mon Jul 31 10:19:51 2006
Romanian stemmer
Hello,
My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
code attached (with vocabulary and stop word list files) and I hope you will
accept and integrate it as a part of the Snowball project. I am ready to
release the stemmer under the BSD license, just as the Snowball software.
The files have been written in UTF-8 encoding (on a Linux system).
Looking forward to hear from you.
Kind regards,
Irina Tirdea
Attachment; [romanian2.tgz]
FROM martin.porter@grapeshot.co.uk (Martin Porter)
TO snowball-discuss@lists.tartarus.org
ON Mon Jul 31 10:43:00 2006
COPIED TO atordai@science.uva.nl, eglockne@ix.urz.uni-heidelberg.de,
irina.tirdea@gmail.com
Tardy response to submissions to Snowball
I am sending this general email as a kind of apology, for having done nothing
so far on the following generously sent Snowball submissions:
7 June, from E. Glockner: a Romanian stemmer
8 June, from A. Tordai: a Hungarian stemmer
and this morning another Romanian stemmer arrived,
31 July, from I. Tirdea, a Romanian stemmer
After the first submission I promised to look at it "next week", so Mr Glockner
has probably been wondering what has happened. [. . .] I will make a point of
looking at these submissions this week,
More soon,
Martin
FROM martin.porter@grapeshot.co.uk (Martin Porter)
TO snowball-discuss@lists.tartarus.org
ON Wed Sep 06 12:39:13 2006
COPIED TO irina.tirdea@gmail.com, eglockne@ix.urz.uni-heidelberg.de,
mstegare@hotmail.com, doina_gliga@yahoo.co.uk, eglockner@hotmail.com
Romanian stemmer
To the originators of the Romanian stemmers,
I have now found time to do some preliminary work on the Romanian stemmer. I
should explain that part of the complication has been the receipt, no more
than ten days apart, of two Romanian stemmers in Snowball, the first
(romanian1) from [Glockner, Gliga, and Stegarescu] in Heidelberg, the second
(romanian2) from Tirdea in Bucharest.
[. . . .]
I have put together a vocabulary by combining the vocabularies provided with
romanian1 and romanian2. This appears in column 1. Column 2 is the stemmed
form produced by romanian1, and column 3 the stemmed form produced by
romanian2. If the entry in column 3 is blank, both stemmers are producing the
same result.
You might care to compare the two approaches.
My own feeling is that romanian1 does a more thorough job of ending removal,
but unlike romanian2 has a habit of discarding too much from short words.
aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
since p2 is never to the left of p1.)
I might have a go at making some modifications here. Needless to say, I am
not familiar with Romanian, but the similarity to the other Romance
languages, especially Italian, enables one to grasp the essential features of
the morphology.
What we would like to do is to have a single stemmer for release from the
snowball site, if that is possible, and giving all necessary credits, along
the lines of the recent addition,
http://snowball.tartarus.org/algorithms/hungarian/stemmer.html
Hope to hear from you,
Martin Porter
Finally we decided to produce our own Romanian stemmer as described on the
Romanian stemmer page. The submitted stemmers both contain stop word lists,
available inside the tarballs.
|