Re: [Snowball-discuss] Japanese stemmer?

From: Olly Betts (olly@survex.com)
Date: Wed Feb 07 2007 - 07:20:44 GMT


On Thu, Feb 01, 2007 at 01:49:10PM -0600, Micah Bly wrote:
> For example, if we start with the word:
> xxxx-sareta (it was xxxx-ed)
>
> We want to get to down the xxxx word
>
> Other things we might run into:
> xxxx-shita
> xxxx-saseta ([i] forced [him/it/her] to xxxx.)
> xxxx-sasemashita (same as above, but polite verb ending)
> xxxx-saserareta (I was forced to xxxx)
> xxxx-saseraremashita (same as above, but polite verb ending)
> xxxx-suru (I will xxxx)
> xxxx-site-iru (i am xxxx'ing)
> xxxx-site-ita (I was xxxx'ing)
> xxxx-sasete-iru (I am forcing [him] to xxxx)
> xxxx-saserarete-ita (I was being forced to xxxx)
> etc etc.
> plus
> xxxx-da, xxxx-desu: [it is a xxxx]
>
> Is it enough to simply put together a big list of possible verb
> endings, and remove them all?

One problem with this approach is when there are words which look like a
verb form but aren't. For example, in English "herring" looks like a
verb form but isn't, "comply" looks like an adverb but isn't, etc.

Another is that sometimes words have the same linguistic root but the
meanings are now sufficiently different that you don't want to conflate
them. An English example is "probe" and "probable".

It's probably worth doing a search for existing academic papers on the
subject - there are a number of specialised search sites: citebase,
citeseer, google scholar, etc. - and see if someone has already analysed
the situation.

> Is there a smart way to do something like that?

In snowball, you could do something like:

    backwordmode (
        define remove as (
            [substring] among (
                'sareta' 'shita' 'saseta' 'sasemashita' 'saserareta' (delete)
            )
        )
    )

If you understand how the endings build up, you can probably remove the
compound ones piece-by-piece to avoid combinatorial explosion of the
number of endings in the list.

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST