Generalizing Subcategorization Frames Acquired from Corpora(3)
发布时间:2021-06-08
发布时间:2021-06-08
This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con
sjforwi,whichexpresseshowreliableawordwihasSCFsj.Weshouldnotethatthecon dencevalueisnottheprobabilitythatawordwiappearswithSCFsjbutaprobabilityofexistenceofSCFsjforthewordwi.Inthisstudy,weassumethatawordwicanhaveeachSCFsjwithacertain(non-zero)probabilityθij(=p(sij|wi)>0where∑jθij=1),butonlySCFswhoseprobabilitiesex-ceedacertainthresholdarerecognizedasSCFsforthewordinthelexicon.Wehereaftercallthisthresholdrecognitionthreshold.Figure2exempli esaprobabil-itydistributionofSCFsforapply.Inthiscontext,wecanregardacon dencevalueofeachSCFasthepossi-bilitythataprobabilityofaSCFexceedstherecognitionthreshold.
Oneintuitivewaytoestimateacon dencevalueistoassumeanobservedprobability,i.e.,relativefrequency,isequaltoaprobabilityθijofSCFsjforawordwi(θij=freqij/∑jfreqijwherefreqijisafrequencycountthatawordwihavetheSCFsjincorpora1).Wesimplyassign1toacon dencevalueconfijwhentherelativefrequencyofsjforawordwiexceedstherecognitionthreshold,andotherwiseassign0toacon dencevalueofconfij.However,anobservedprobabilityistotallyunreliableforinfrequentwords.Forexample,whenweuseacon dencevaluederivedfromarelativefrequencyasabove,wecannotdistinguishcaseswhereawordw1appearsoncewithaSCFsjandawordw2appears100times,alwayswiththeSCFsj,whichareboththerela-tivefrequency1.Moreover,evenwhenwewouldliketoencodecon dencevaluesofreliableSCFsinthetargetlexicalizedgrammar,itisalsoproblematictodistinguishthecon dencevalueofthoseSCFswithcon denceval-uesofacquiredSCFs.
TheotherpromisingwaytoestimateatrueprobabilityθijistoregarditasastochasticvariableinthecontextofBayesianstatistics(Gelmanetal.,1995).Inthiscontext,aposterioridistributionoftheprobabilityθijofaSCFsjforawordwiisgivenby:
p(θij|D)=
=
P(θij)P(D|θij)
P(D)
P(θij)P(D|θij)
representedbybinominaldistribution:
n
θixj(1 θij)(n x).P(D|θij)=
x
(2)
Tocalculatethisaposterioridistribution,weneedtode netheaprioridistributionP(θij).Thequestioniswhichprobabilitydistributionofθijcanappropriatelyre- ectpriorknowledge.Inotherwords,itshouldencodeknowledgeweusetoestimateSCFsforanunknownwordwi.Wesimplydetermineitfromdistributionsofproba-bilityvaluesofsjforknownwords.Weusedistributionsofobservedprobabilityvaluesofsjforallwordsacquiredfromthecorpusbyusingamethoddescribedin(Tsu-ruokaandChikayama,2001).Intheirstudy,theyassumeaprioridistributionasthebetadistributionde nedas:
p(θij|α,β)=
1θiα(1 θij)β 1j
B(α,β)
,(3)
1
whereB(α,β)=01θiα(1 θij)β 1dθij.Thevalueofj
αandβisdeterminedbymomentestimation.2Bysub-stitutingEquations2and3intoEquation1,we nallyobtaintheaposterioridistributionp(θij|D)as:
1θiα(1 θij)β 1 n xj(n x)
xθij(1 θij)0P(θij)P(D|θij)dθij
c·θixj+α 1(1 θij)n x+β 1(4)
p(θij|α,β,D)=
=
1
/(B(α,β)wherec=n0P(θij)P(D|θij)dθij).x
Whenwedeterminethevalueoftherecognitionthresholdast,wecancalculateacon dencevalueconfijthatawordwicanhavesjbyintegratingtheaposterioridistributionp(θij|D)fromthethresholdtto1:
confij
=
1
t
c·θixj+α 1(1 θij)n x+β 1dθij(5)
P(θij)P(D|θij)dθij
,(1)
Byusingthiscon dencevalue,wecanexpressanSCFcon dence-valuevectorviforawordwiintheacquiredSCFlexicon(vij=confij).3
InordertocombineSCFcon dence-valuevectorsforwordsacquiredfromcorporaandthoseforwordsinthe
expectationvalueandvarianceofthebetadistribution
aremadeequaltothoseoftheobservedprobabilityvalues.3Byusingthefactthat 1P(θ|α,β)=1,wecancalculate
ij0
confijasfollows.
1
2The
whereP(θij)isaprioridistribution,andDisthedatawehaveobserved.SinceeveryoccurrenceofSCFsinthedataDisindependentwitheachother,thedataDcanberegardedasBernoullitrialsinthiscase.WhenweobservethedataDthatawordwiappearsntimesandhasSCFsjx(≤n)times,itsconditionaldistributionistherefore
1WeusedvaluesofFREQCNTtoobtainfrequencycountsof
confij==
x+α 1
(1 θij)n x+β 1dθijtc·θij
(1 θij)n x+β 1dθij0c·θij
1x+α 1
(1 θij)n x+β 1dθijtθij
x+ 1
(1 θij)n x+β 1dθij0θij
(6)
SCFs.
下一篇:一对一教务部工作规范手册