Generalizing Subcategorization Frames Acquired from Corpora(3)

发布时间:2021-06-08

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con

sjforwi,whichexpresseshowreliableawordwihasSCFsj.Weshouldnotethatthecon dencevalueisnottheprobabilitythatawordwiappearswithSCFsjbutaprobabilityofexistenceofSCFsjforthewordwi.Inthisstudy,weassumethatawordwicanhaveeachSCFsjwithacertain(non-zero)probabilityθij(=p(sij|wi)>0where∑jθij=1),butonlySCFswhoseprobabilitiesex-ceedacertainthresholdarerecognizedasSCFsforthewordinthelexicon.Wehereaftercallthisthresholdrecognitionthreshold.Figure2exempli esaprobabil-itydistributionofSCFsforapply.Inthiscontext,wecanregardacon dencevalueofeachSCFasthepossi-bilitythataprobabilityofaSCFexceedstherecognitionthreshold.

Oneintuitivewaytoestimateacon dencevalueistoassumeanobservedprobability,i.e.,relativefrequency,isequaltoaprobabilityθijofSCFsjforawordwi(θij=freqij/∑jfreqijwherefreqijisafrequencycountthatawordwihavetheSCFsjincorpora1).Wesimplyassign1toacon dencevalueconfijwhentherelativefrequencyofsjforawordwiexceedstherecognitionthreshold,andotherwiseassign0toacon dencevalueofconfij.However,anobservedprobabilityistotallyunreliableforinfrequentwords.Forexample,whenweuseacon dencevaluederivedfromarelativefrequencyasabove,wecannotdistinguishcaseswhereawordw1appearsoncewithaSCFsjandawordw2appears100times,alwayswiththeSCFsj,whichareboththerela-tivefrequency1.Moreover,evenwhenwewouldliketoencodecon dencevaluesofreliableSCFsinthetargetlexicalizedgrammar,itisalsoproblematictodistinguishthecon dencevalueofthoseSCFswithcon denceval-uesofacquiredSCFs.

TheotherpromisingwaytoestimateatrueprobabilityθijistoregarditasastochasticvariableinthecontextofBayesianstatistics(Gelmanetal.,1995).Inthiscontext,aposterioridistributionoftheprobabilityθijofaSCFsjforawordwiisgivenby:

p(θij|D)=

=

P(θij)P(D|θij)

P(D)

P(θij)P(D|θij)

representedbybinominaldistribution:

n

θixj(1 θij)(n x).P(D|θij)=

x

(2)

Tocalculatethisaposterioridistribution,weneedtode netheaprioridistributionP(θij).Thequestioniswhichprobabilitydistributionofθijcanappropriatelyre- ectpriorknowledge.Inotherwords,itshouldencodeknowledgeweusetoestimateSCFsforanunknownwordwi.Wesimplydetermineitfromdistributionsofproba-bilityvaluesofsjforknownwords.Weusedistributionsofobservedprobabilityvaluesofsjforallwordsacquiredfromthecorpusbyusingamethoddescribedin(Tsu-ruokaandChikayama,2001).Intheirstudy,theyassumeaprioridistributionasthebetadistributionde nedas:

p(θij|α,β)=

1θiα(1 θij)β 1j

B(α,β)

,(3)

1

whereB(α,β)=01θiα(1 θij)β 1dθij.Thevalueofj

αandβisdeterminedbymomentestimation.2Bysub-stitutingEquations2and3intoEquation1,we nallyobtaintheaposterioridistributionp(θij|D)as:

1θiα(1 θij)β 1 n xj(n x)

xθij(1 θij)0P(θij)P(D|θij)dθij

c·θixj+α 1(1 θij)n x+β 1(4)

p(θij|α,β,D)=

=

1

/(B(α,β)wherec=n0P(θij)P(D|θij)dθij).x

Whenwedeterminethevalueoftherecognitionthresholdast,wecancalculateacon dencevalueconfijthatawordwicanhavesjbyintegratingtheaposterioridistributionp(θij|D)fromthethresholdtto1:

confij

=

1

t

c·θixj+α 1(1 θij)n x+β 1dθij(5)

P(θij)P(D|θij)dθij

,(1)

Byusingthiscon dencevalue,wecanexpressanSCFcon dence-valuevectorviforawordwiintheacquiredSCFlexicon(vij=confij).3

InordertocombineSCFcon dence-valuevectorsforwordsacquiredfromcorporaandthoseforwordsinthe

expectationvalueandvarianceofthebetadistribution

aremadeequaltothoseoftheobservedprobabilityvalues.3Byusingthefactthat 1P(θ|α,β)=1,wecancalculate

ij0

confijasfollows.

1

2The

whereP(θij)isaprioridistribution,andDisthedatawehaveobserved.SinceeveryoccurrenceofSCFsinthedataDisindependentwitheachother,thedataDcanberegardedasBernoullitrialsinthiscase.WhenweobservethedataDthatawordwiappearsntimesandhasSCFsjx(≤n)times,itsconditionaldistributionistherefore

1WeusedvaluesofFREQCNTtoobtainfrequencycountsof

confij==

x+α 1

(1 θij)n x+β 1dθijtc·θij

(1 θij)n x+β 1dθij0c·θij

1x+α 1

(1 θij)n x+β 1dθijtθij

x+ 1

(1 θij)n x+β 1dθij0θij

(6)

SCFs.

Generalizing Subcategorization Frames Acquired from Corpora(3).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:7 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:29元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219