RE: [Snowball-discuss] Stop word lists

From: Oleg Bartunov (
Date: Fri Oct 18 2002 - 20:09:02 BST

On Fri, 18 Oct 2002, Martin Porter wrote:

> Oleg,
> I have incorporated your corrections into the Russian stop word list. The
> reason they arose was that Pat Miles prepared the list in the transliterated
> form, and I never showed him the Cyrillic equivalent. It is of course easy
> to make mistakes if you're not used to the transliteration scheme.
> (I might say that I now regard the Library of Congress transliteration
> scheme as very unnatural. Even so, I can't think up anything better that
> guarantees two-way translation.)

Martin, could you try virtual keyboard

> I've looked at the list you sent me, and it seems to contain paradigm forms
> only - at least for some of the words. So kak is there, but not kakai^a,
> kakoi`, kakov, kakovo. My list also omits some of these. Actually, it is not
> easy for me to put together a more complete list. I am beginning to suffer
> by not being able to input Cyrillic at the keyboard, which is a great
> nuisance. So if you would like to take control of the Russian stop word list
> for the Snowball site you are more than welcome!

I have now a list of russian words ranged by frequency. I got it from
recent crawl of 10 mln. pages. Unfortunately, I'm very busy but I'll
try to do something for snowball site.

> A few questions:
> Is KOI8-R fairly universal in Russia for representing Cyrillic? Other
> codings are mentioned in the browsers: ISO-8859-S, CP-866 etc - I've no idea
> what they mean. Are any of them ever used?

Almost all of them are in use ! We have special module for apache web server
to convert encodings. koi8-r is used mostly in Unix environment and mails,
while cp-1251 - in Windows, CP-866 - in Dos, ...

read about koi8-r

> I've added a note in the stopword list that e" (e with two dots) is
> translated to e, as you advise. But is e" ever used outside dictionaries and
> grammars in Russian? I know what it means (-e- pronounced heavy as o, as in
> 'Gorbache"v'), but I thought it was always printed as 'e'.

In practice, e" used in printed forms. And most search engines just
translates it to 'e'. I even forgot where is e" on my keyboard :)

> I know that some languages that use Cyrillic (but not of course Russian)
> have accented Cyrillic letters. Is there a standard way of encoding these in
> KOI8-R?

There is a page about this problem -

The main conclusion is to use Unicode

> Martin

Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
phone: +007(095)939-16-83, +007(095)939-23-83

This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST