Two Romanian stemmers


 

Links to resources

Snowball main page
romanian1.tgz, from Glockner and others
romanian2.tgz, from Tirdea


In swift succession, we received in 2006 two stemmers for Romanian written in Snowball. Here is the original correspondence,

FROM Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
TO snowball-discuss@lists.tartarus.org
ON Wed Jun 07 08:56:44 2006

[Snowball-discuss] romanian stemmer

 Hello everyone,

 my name is Erwín Glockner, I'm a student of computational linguistics in
 Heidelberg, Germany. Together with my fellow students Doina Gliga and
 Marina Stegarescu we started to write a romanian stemmer in Snowball.
 We planned to finish the stemmer until end of this month. We would be
 happy if the stemmer would be accepted as part of the Snowball-distribution.
 There is still some work to do, e.g. evaluating the stemmer, making a
 stopwords-list, unicode support, etc. After finishing this we will send
 you our stemmer with the corresponding files, but I couldn't find any
 email adress to whom the stemmer should be sent to.
 Could please someone tell me the address(es)?

 With kind regards,
 E. Glockner, D. Gliga, M. Stegarescu.



FROM Erwin Glockner <eglockne@ix.urz.uni-heidelberg.de>
TO richard@lemurconsulting.com
TO martin.porter@grapeshot.co.uk
ON Tue Jul 18 19:43:39 2006

romanian stemmer

 Dear Mr. Porter, dear Mr. Boulton,

 we finally finished the Romanian stemmer. Unfortunately evaluation took
 more time than expected.
 However, it was an interesting experience creating the stemmer, and we
 are happy to send you the result of our work.
 The attachment-file is a Tarball-zipped file with (hopefully) all files
 needed. The files and the stemmer as well are encoded in UTF-8. Please
 inform us if something is missing.

 We would be happy if the Romanian stemmer would be accepted and
 integrated into the official Snowball distribution. We agree of course
 to license the stemmer under the same terms as the existing snowball
 software.

 We're looking forward to hear from you soon.


 With kind regards,

 Marina, Doina and Erwin.

 Attachment: [romanian1.tgz]


FROM Irina Tirdea <irina.tirdea@gmail.com>
TO snowball-discuss@lists.tartarus.org
TO richard@lemurconsulting.com
TO martin.porter@grapeshot.co.uk
ON Mon Jul 31 10:19:51 2006

Romanian stemmer

 Hello,

 My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
 as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
 code attached (with vocabulary and stop word list files) and I hope you will
 accept and integrate it as a part of the Snowball project. I am ready to
 release the stemmer under the BSD license, just as the Snowball software.
 The files have been written in UTF-8 encoding (on a Linux system).

 Looking forward to hear from you.

 Kind regards,
 Irina Tirdea

 Attachment; [romanian2.tgz]


FROM martin.porter@grapeshot.co.uk (Martin Porter)
TO snowball-discuss@lists.tartarus.org
ON Mon Jul 31 10:43:00 2006
COPIED TO atordai@science.uva.nl, eglockne@ix.urz.uni-heidelberg.de, irina.tirdea@gmail.com

Tardy response to submissions to Snowball


 I am sending this general email as a kind of apology, for having done nothing
 so far on the following generously sent Snowball submissions:

 7 June, from E. Glockner: a Romanian stemmer
 8 June, from A. Tordai: a Hungarian stemmer

 and this morning another Romanian stemmer arrived,

 31 July, from I. Tirdea, a Romanian stemmer

 After the first submission I promised to look at it "next week", so Mr Glockner
 has probably been wondering what has happened. [. . .] I will make a point of
 looking at these submissions this week,

 More soon,

 Martin



FROM martin.porter@grapeshot.co.uk (Martin Porter)
TO snowball-discuss@lists.tartarus.org
ON Wed Sep 06 12:39:13 2006
COPIED TO irina.tirdea@gmail.com, eglockne@ix.urz.uni-heidelberg.de, mstegare@hotmail.com, doina_gliga@yahoo.co.uk, eglockner@hotmail.com

Romanian stemmer


 To the originators of the Romanian stemmers,

 I have now found time to do some preliminary work on the Romanian stemmer. I
 should explain that part of the complication has been the receipt, no more
 than ten days apart, of two Romanian stemmers in Snowball, the first
 (romanian1) from [Glockner, Gliga, and Stegarescu] in Heidelberg, the second
 (romanian2) from Tirdea in Bucharest.

 [. . . .]

 I have put together a vocabulary by combining the vocabularies provided with
 romanian1 and romanian2. This appears in column 1. Column 2 is the stemmed
 form produced by romanian1, and column 3 the stemmed form produced by
 romanian2. If the entry in column 3 is blank, both stemmers are producing the
 same result.

 You might care to compare the two approaches.

 My own feeling is that romanian1 does a more thorough job of ending removal,
 but unlike romanian2 has a habit of discarding too much from short words.
 aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
 test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
 since p2 is never to the left of p1.)

 I might have a go at making some modifications here. Needless to say, I am
 not familiar with Romanian, but the similarity to the other Romance
 languages, especially Italian, enables one to grasp the essential features of
 the morphology.

 What we would like to do is to have a single stemmer for release from the
 snowball site, if that is possible, and giving all necessary credits, along
 the lines of the recent addition,

 http://snowball.tartarus.org/algorithms/hungarian/stemmer.html

 Hope to hear from you,

 Martin Porter

Finally we decided to produce our own Romanian stemmer as described on the Romanian stemmer page. The submitted stemmers both contain stop word lists, available inside the tarballs.