Hi everyone, thank you for your replies! The way I would like to use the
stemmer is as an additional tool along with an inflection dictionary, to get
base forms of words unknown in the dictionary. I am aware of the difficulty
of building a suffix stripper for Polish, in Polish we have for instance
changes in the root besides changes in the suffix, so I imagine the rule set
would be complex. I think I will give it a try one day, but I don't expect
to get it even close to decent efficiency, but I'm hoping it would be good
enough to reduce the problem of multiple forms of unknown words in the
collection index. I noticed the Stempelator stemmer has problems with such
words, so I wonder whether a simpler suffix stripper wouldn't suffice.
Thanks for your comments!
On 8/29/07, Dawid Weiss <email@example.com> wrote:
> Ok, maybe that was a bit of an overstatement -- I don't think Polish is
> more complex compared to Russian (don't know about Finnish). It's just my
> feeling that rule-based stemmers don't work too well for Polish (quite
> combinations at the morphology level). Now, having said that the
> stemmer I mentioned is built using inflected-form-generation rules (from
> forms), so it should be possible to reuse this knowledge somehow if one
> to create a Snowball stemmer. If you're willing to undertake such effort,
> Agnieszka, don't let anyone discourage you (and in particular don't let me
> discourage you).
> I would be actually very curious about the level of quality such a stemmer
> achieve (manually constructed rules). I know for a fact a number of people
> benefit from it.
> Martin Porter wrote:
> > On Wed, 2007-08-29 at 08:16 +0200, Dawid Weiss wrote:
> >> Hi Agnieszka,
> >> (I am not a snowball developer, but...) It won't be easy to handle the
> >> complexity of Polish in a set of Snowball rules.
> > Dawid,
> > Do you have any strong evidence for that? I would not have thought
> > Polish was more complex than Finnish, or Russian for example.
> > Martin
This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST