Re: [Snowball-discuss] Stop word lists

From: Olly Betts (
Date: Tue Oct 08 2002 - 21:04:01 BST

On Tue, Oct 08, 2002 at 01:38:19PM -0600, Martin Porter wrote:
> The Google stopword list is very interesting. The basic list for English,
> is, in my experience
> { the a and of to in an }
> which works well on titles technical papers.
> I rather doubt the 'en' is there because it is a French/Spanish word. It is
> not all that common - much less common than 'de' for example. Could it be
> connected with the language code for English do you think?

Just checked and "de" is also a Google stopword. This might be new, or
might be because I based my tests on an english wordlist
(/usr/share/dict/words on Linux) so "en" was in, but "de" wasn't.
I thought I'd also checked for all 1-3 letter combinations, but it
was a while ago, and my memory is hazy.

[To try this yourself, search for a valid word and up to 9 stopwords
candidates - e.g "test de en et le la les der die das" - 10 at most
because Google truncates queries at 10 non-stopwords]

So Google also stops "de" and "la" (but not "le" oddly). There may be
others of course.


This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST