Re: [Snowball-discuss] Mismatch between vocab.txt and output.txt

From: Olly Betts (olly@survex.com)
Date: Mon Oct 14 2002 - 15:38:01 BST


On Mon, Oct 14, 2002 at 06:08:14PM +0400, Oleg Bartunov wrote:
> On Mon, 14 Oct 2002, Olly Betts wrote:
>
> > There was a mismatch in the order of the stemmers in a table in my own
> > code - I had "french" and "finnish" switched, so I was stemming finnish
> > with the french stemmer (and vice versa).
>
> Olly, it's interesting how do you decide which stemmer to use.
> As I understand, stemmer in definition uderstand any word !

The programmer decides - this was in the Xapian wrappers for Snowball.
When the programmer created a stemming object:

    OmStemmer stemmer("finnish");

this would actually create an OmStemmer object set up with the French
stemming algorithm.

> So, I don't see any chance to stem bilingual documents.

To do it properly you need to identify the language and stem appropriately,
or rely on markup in the source data to tell you - for example, HTML
allows this:

    Pardon my <span lang="fr">Fran&ccedil;ais</span>.

Language indentification is probably reliable enough to decide this on a
per paragraph level well enough to improve retrieval in a document
collection with multiple languages per document.

You also need to address how to stem queries - you could just say that
user entered queries are always in one language, and ask them to select
which, or you could try to automatically identify the language (tricky
for such a small piece of text), or you could try performing several
searches with the query stemmed in different ways.

> Luckily, we could distinguish russian and english using character
> code, but in french-english case it's impossible.

You can feed the text through a language indentifier prior to stemming.
N-gram matching is simple and works suprisingly well - for example:

http://odur.let.rug.nl/~vannoord/TextCat/Demo/

For Muscat 3.6, Martin came up with the approach of keying off common
words in each language (in a way this is like saying that the stop words
to identify the language). But you can think of this as an N-gram
approach where every N-gram has to begin and end with a word boundary.

As you suggest, you can also consider the characters used in each language
(if it uses the Cyrillic alphabet, it's not English, though it isn't
necessarily Russian either). This can be thought of as using an N-gram
approach with 1-grams (the characters themselves).

Cheers,
    Olly



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST