Re: [Snowball-discuss] Japanese stemmer?

From: Micah Bly (micah.j.bly@medtronic.com)
Date: Thu Feb 01 2007 - 19:49:10 GMT


The word-splitting is a real problem if you need to do it. In my
case though, I'm working with a set of pre-broken words (a
terminology list), which I need to compare to a blob of text
(usually) a sentence. My goal is solely to determine if the words in
the first list are present in the text blob. So I basically have a
free pass to word-splitting.

When I do this with English, I have to stem both sets of strings. But
with Japanese, I think it will be enough to stem the terminology list
words, since we ignore whitespace in Japanese anyway.

For example, if we start with the word:
xxxx-sareta (it was xxxx-ed)

We want to get to down the xxxx word

Other things we might run into:
xxxx-shita
xxxx-saseta ([i] forced [him/it/her] to xxxx.)
xxxx-sasemashita (same as above, but polite verb ending)
xxxx-saserareta (I was forced to xxxx)
xxxx-saseraremashita (same as above, but polite verb ending)
xxxx-suru (I will xxxx)
xxxx-site-iru (i am xxxx'ing)
xxxx-site-ita (I was xxxx'ing)
xxxx-sasete-iru (I am forcing [him] to xxxx)
xxxx-saserarete-ita (I was being forced to xxxx)
etc etc.
plus
xxxx-da, xxxx-desu: [it is a xxxx]

Is it enough to simply put together a big list of possible verb
endings, and remove them all? Is there a smart way to do something
like that?

Micah Bly

On Jan 29, 2007, at 4:00 AM, Martin Porter wrote:
>
> At least in principle, I'm interested myself in collaborating to
> make a
> Japanese stemmer. However I must add a few caveats. I am currently
> rather busy with other work, and I tried a little while ago to get
> into
> Arabic sufficiently to try coding up a stemmer, and eventually
> abandoned
> it. I found the language to difficult. So I'm not sure how well I'd
> get
> on with Japanese.
>
> And what about the problem of word-splitting?



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:49 BST