Re: [Snowball-discuss] modifying Java English stemmer to accept newexceptions

From: Martin Porter (martin.porter@grapeshot.co.uk)
Date: Sat Jun 24 2006 - 10:42:15 BST


Brian,

I can't comment on the Java code, since Richard Boulton wrote the
codegenerator for it, and undertakes support (I don't know if Richard cares
to comment?), but I would advise modifying the Snowball scripts and
codegenerating the Java afresh. In fact, you pretty well have to do that,
since the 'among's compile into optimized table structures that you can't
really modify by hand (add an extra item and the whole table changes...)

I did at one time collect all the main irregularities of English verbs, with
a view to doing what you are now attempting, although I never put a new
stemmer based on it into service. You have to be very careful: the past of
'see' is 'saw', but a saw is also a cutting tool, and so on. So you might
find the table below, which I put together many years ago, useful. You can
find these tables in dictionaries and grammars, but they are rarely
complete, and are often cluttered with archaic forms that are no longer useful.

Martin

 
Paradigm form verb list
----------------------------------------------------------------
SMELL SMELT (r) burn learn spell smell spill spoil dwell(a)
BEND BENT bend build lend send spend rend(a) gird(a)
HIT HIT (*) bet burst cast cost(r) cut hit hurt let put
                         quit rid set shed shut slit split spread
                         thrust upset wet(r)
SEW SEWED SEWN sew sow show hew(a) mow(r) saw(r) strew(r)
                         shave(r)
BEAT BEAT BEATEN beat
DRINK DRANK DRUNK begin drink ring shrink sing sink spring stink
                         swim
WIN WON cling dig fling sling spin stick sting string
                         swing win wring slink
SIT SAT sit spit
BLEED BLED bleed breed feed lead meet read speed
GET GOT get
HANG HUNG hang
FIND FOUND bind find grind wind
LIGHT LIT light slide
SHINE SHONE shine
FIGHT FOUGHT fight
STRIKE STRUCK strike
HOLD HELD hold
SHOOT SHOT shoot
COME CAME COME come become
RUN RAN RUN run
KEEP KEPT creep keep leap sweep sleep weep
SELL SOLD sell tell
FLEE FLED flee
HEAR HEARD hear
SAY SAID say
SHOE SHOD shoe
MEAN MEANT deal dream feel kneel lean mean
BUY BOUGHT buy
LEAVE LEFT leave bereave(a)
LOSE LOST lose
RIDE RODE RIDDEN drive ride rise arise strive write smite(a)
STRIDE STRODE - stride
FLY FLEW FLOWN fly
STEAL STOLE STOLEN freeze speak steal weave
BREAK BROKE BROKEN break wake awake
FORGET FORGOT FORGOTTEN forget tread
BEAR BORE BORNE bear tear swear wear
LIE LAY LAIN lie
BITE BIT BITTEN bite hide
CHOOSE CHOSE CHOSEN choose
SEE SAW SEEN see
EAT ATE EATEN eat
FORBID FORBADE FORBIDDEN forbid forgive give bid(a)
TAKE TOOK TAKEN forsake(a) shake take
FALL FELL FALLEN fall
DRAW DREW DRAWN draw
GROW GREW GROWN blow grow know throw
SLAY SLEW SLAIN slay(a)
SWELL SWELLED SWOLLEN swell(r)
SHEAR SHEARED SHORN shear(r)
MAKE MADE make
BRING BROUGHT bring think
TEACH TAUGHT teach beseech(a) seek(a)
CATCH CAUGHT catch
STAND STOOD stand understand
GO WENT GONE go
DO DID DONE do

Verbs marked (r) also have regular forms. Verbs marked (a) are archaic. Verbs
marked (*) are irregular, but not in a way that causes difficulties to a
stemming algorithm.

The pp of `hang' is `hanged' or `hung', depending on the sense. `lie' is
irregular when it means `lying down', regular when it means `telling
falsehoods'. `stride' has no pp in normal use.

We are left with 135 verbs with irregularities in the past or pp forms:

 arise awake bear beat become begin bend bind bite bleed blow
 break breed bring build burn buy catch choose cling come creep
 deal dig do draw dream drink drive eat fall feed feel fight find
 flee fling fly forbid forget forgive freeze get give go grind
 grow hang hear hide hold keep kneel know lead lean leap learn
 leave lend lie light lose make mean meet mow read ride ring rise
 run saw say see sell send sew shake shave shear shine shoe shoot
 show shrink sing sink sit sleep slide sling slink smell sow
 speak speed spell spend spill spin spit spoil spring stand steal
 stick sting stink strew stride strike string strive swear sweep
 swell swim swing take teach tear tell think throw tread
 understand wake wear weave weep win wind wring write

plus these 20 invariant forms:

 bet burst cast cost cut hit hurt let put quit rid set shed shut
 slit split spread thrust upset wet

and these 11 archaic forms, which might/might not be included:

 bereave beseech bid dwell forsake gird hew rend seek slay smite



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:48 BST