RE: [Snowball-discuss] Stop word lists

From: Oleg Bartunov (oleg@sai.msu.su)
Date: Tue Oct 08 2002 - 16:00:01 BST


I agree with Alex - stop list is very domain specific !
But as "better than nothing" I attached russian stop list in
koi8-r encoding.

        Oleg

On Tue, 8 Oct 2002, Alex Murzaku wrote:

> The sources for those lists are either SMART, Oracle, or a Russian
> search engine that I don't remember (maybe mnogosearch or mysql).
>
> Anyway, I haven't spent time studying them: I needed a solution and this
> was much faster than what I initially started doing (sieving through
> high frequency wordlists and removing or adding words depending on my
> judgment.)
>
> I think that the content of a stopword list is really application
> dependent. Apparently, the English stopwords I provided are derived by
> some business text corpora and were intended for that audience which
> coincided with what I needed. Of course, depending on the intelligence
> of indexing (some kind of context aware indexing e.g.), stopword lists
> might not be needed at all because, in that case, the difference between
> A BOOK and THE BOOK would mean a lot.
>
> In any case, the lists I sent were only offered as "better than nothing"
> and as a possible starting point. I didn't know you had them already.
> Maybe native speakers of the other languages might have also preexisting
> lists and/or suggestions to be used for perfectioning the word sets.
> Sometime these discussions help, but when judgments are too subjective,
> they have the risk of creating more confusion... :)
>
> -----Original Message-----
> From: snowball-discuss-admin@lists.tartarus.org
> [mailto:snowball-discuss-admin@lists.tartarus.org] On Behalf Of Martin
> Porter
> Sent: Tuesday, October 08, 2002 6:34 AM
> To: Snowball discuss
> Cc: lists@lissus.com
> Subject: RE: [Snowball-discuss] Stop word lists
>
>
>
> Alex,
>
>
> I have now looked at the stopword lists you sent yesterday, and they
> have increased my confidence in the quality of the Snowball ones. I have
> looked at the English one very carefully, and can report on the
> findings.
>
> If x is the Snowball stopword list for English, and y is the English
> stopword list you sent me, we can look at the various sets x, y, x-y,
> y-x, x or y, x and y. Their sizes are as follows:
>
> | x | = 119
> | y | = 76
> | x-y | = 59
> | y-x | = 16
> | x or y | = 135
> | x and y | = 60
>
> and the sets themselves are,
>
> x = { a about above after again against all am an and any are as at be
> because been before being below between both but by did do does doing
> down during each few for from further had has have having he her here
> hers herself him himself his how i if in into is it its itself me more
> most my myself no nor not of off on once only or other our ours
> ourselves out over own same she so some such than that the their theirs
> them themselves then there these they this those through to too under
> until up very was we were what when where which while who whom why with
> you your yours yourself yourselves }
>
> y = { a about after all also an and any are as at be because been but by
> can co corp could for from had has have he her his if in inc into is it
> its last more most mr mrs ms mz no not of on one only or other out over
> s says she so some such than that the their there they this to up was we
> were when which who will with would }
>
> x-y = { above again against am before being below between both did do
> does doing down during each few further having here hers herself him
> himself how i itself me my myself nor off once our ours ourselves own
> same theirs them themselves then these those through too under until
> very what where while whom why you your yours yourself yourselves }
>
> y-x = { also can co corp could inc last mr mrs ms mz one s says will
> would }
>
> As you can see, x is substantially larger than y, and the terms in x-y
> are plausible stopwords. But if you take the 16 terms in y-x, 6 are
> mentioned in the comments in the source of x, and so could always be
> picked up by users working from the source:
>
> auxiliaries: can could will would
> common words: also says
>
> 7 are components of names of people and organisations and should only be
> treated as stopwords in rather special circumstances:
>
> co corp inc mr mrs ms mz
>
> which leaves
>
> s one last.
>
> 's' is the second component in words like John's, boy's ... and is not
> really a stopword, assuming the indexing is done intelligently. 'last' I
> don't think should be a stopword ('The Last Detail', 'The cobbler's
> last' ...). 'one' on the other hand is an omission from x, even if it
> should only be mentioned in the notes. I will fix it up. (I can see how
> 'one' came to be omitted, but won't bore you with the
> details.)
>
> I will look more closely at the other stopword lists in due course.
>
> Where did they come from? I would like to put the Finnish one in place
> in the interim.
>
> ------
>
> Actually the English stopword list is the only one I did not make up
> myself. It derives from a list which used to be used in IR experiments
> in Cambridge and which I have modified over the years. An early form of
> it can be found on pp.18-19 of van Rijsbergen's 'Information Retrieval',
> Butterworths, 1975. Interestingly, that list contains 'co', which I
> remember removing many years ago. I am still doubtful about some of the
> entries: 'very' and 'further' for example.
>
>
> Martin
>
>
>
>
> _______________________________________________
> Snowball-discuss mailing list Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83





This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:43 BST