Generalizing Subcategorization Frames Acquired from Corpora(4)

发布时间:2021-06-08

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con

Input:asetofSCFconfidence-value

vectorsV={v1,v2,...,vn} Rm

adistancefunctiond:Rm×Zm→Rafunctiontocomputeacentroidµ:{vj1,vj2,...,vjl}→Rm

Output:asetofclustersCj

whileclustermembersarenotstabledoforeachclusterCj

Cj={vi| cl,d(vi,cj)≤d(vi,cl)}endforeach

foreachclustersCjcj=µ(Cj)endforeachendwhilereturnCj

Aftereveryassignment,wedetermineanextcentroidcmofeachclusterCmasfollows:

1when∏vij>∏(1 vij)

vi∈Cmvi∈Cmcmj=(9)

0otherwiseWethenaddressthewaytodeterminethenumberof

clustersandinitialassignmentsofobjects.Inthispaper,weassumethatthemostofthepossiblesetofSCFsforwordsareincludedinthetargetlexicalizedgrammar,andmakeuseoftheexistingsetsofSCFsforthewordsinthelexiconofthetargetgrammartodeterminethepossiblesetofSCFsforwordsoutofthelexicon.We rstex-tractSCFcon dence-valuevectorsfromthelexiconofthetargetgrammarbyregardingε=0inEquation7.Byeliminatingduplicationsfromthem,weobtainSCFcentroid-valuevectorscm.Wetheninitializethenumberofclustersktothenumberofcmandusethemasinitialcentroids.4

We nallyupdatetheacquiredSCFsusingeachele-ment’svalueinthecentroidofeachclusterandthecon -dencevalueofSCFsinthisorder.We rsteliminateSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis0,andaddSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis1.Thisisbecausecmjrep-resentswhetherthewordsinthatclasscanhaveSCFsj.WetheneliminateimplausibleSCFssjforwifromtheresultingSCFsaccordingtoitscorrespondingcon dencevalueconfij.Wecallthiseliminationcentroidcut-off.Inthefollowingexperiments,wecomparethiscut-offwithnaivefrequencycut-off,whichusesonlyrelativefrequen-ciestoeliminateSCFsandcon dencecut-off,whichusesonlycon dencevaluestoeliminateSCFs.Notethatfre-quencycut-offandcon dencecut-offuseonlycorpus-basedstatisticstoeliminateSCFs.

Figure3:ClusteringalgorithmforSCFcon dence-valuedistributions

lexiconofthetargetgrammar,wealsorepresentSCFcon dence-valuevectorsforthewordsinthetargetgram-mars.Inthispaper,weexpressSCFcon dence-valuevectorsv iforwordsintheSCFlexiconofthetargetgram-marby:

1 εwihassjinthelexicon

vij=(7)

εotherwisewhereεexpressesanunreliabilityofthelexicon.Inthis

study,wesimplysetittothemachineepsilon.Inotherwords,wetrustthelexiconasmuchaspossible.3.2

ClusteringAlgorithmforSCFCon dence-ValueDistributions

Wenextpresentak-Means-likeclusteringalgorithmforSCFcon dence-valuevectors,asshowninFigure3.Givenaninitialassignmentofdataobjectstokclusters,ouralgorithmcomputesarepresentativevalueofeachclustercalledcentroids.Ouralgorithmtheniterativelyupdatesclustersbyassigningeachobjecttoitsclosestcentroidandrecomputingcentroidsuntilclustermembersbecomestable.

Althoughouralgorithmisroughlybasedonthek-Meansalgorithm,itisdifferentinanimportantrespect.Wede netheelementsofthecentroidvaluesoftheob-tainedclustersasadiscretevalueof0or1becausewewanttoobtainclusterswhichincludewordsthathavetheexactlysamesetofSCFs.Wethenderiveadistancefunc-tiondtocalculatethedistancefromadataobjectvitoeachcentroidcm.Sincethedistancefunctionisusedtodeterminetheclosestclusterforvi,wede nethefunc-tiondtooutputtheprobabilitythatvihastheSCFsetexpressedbycentroidcmasfollows:

d(vi,cm)=

cmj=1

4Experiments

vij·

cmj=0

∏(1 vij).

(8)

WeappliedourmethodtoanSCFlexiconacquiredfrom135,,whichisthesamedatausedin(CarrollandFang,2004).Thenumberoftheresult-ingSCFsis14,783for3,864wordstems.Wethentrans-latedthemtoanSCFlexiconfortheXTAGEnglishgram-mar(XTAGResearchGroup,2001)byusingatranslationmapmanuallyde nedbyTedBriscoe.Itde nesamap-pingfrom23outof163possibleSCFtypesinto13outof57XTAGSCFscalledtreefamilieslistedinTable1.ThenumberofresultingSCFsfortheXTAGEnglishgram-marwas6,742for2,860wordstems.

alexiconofthegrammarisnotcomprehensiveor

lessaccurate,weshoulddeterminethenumberofclustersusingotheralgorithms(Bischofetal.,1999;Hamerly,2003).

4When

Byusingthisfunction,wecandeterminetheclosestclus-terasargmaxd(vi,cm).

Cm

Generalizing Subcategorization Frames Acquired from Corpora(4).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:7 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:29元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219