Generalizing Subcategorization Frames Acquired from Corpora(2)
发布时间:2021-06-08
发布时间:2021-06-08
This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con
(#S(EPATTERN:TARGET|ftp|
:SUBCAT(VSUBCATNONE):CLASSES(222985):RELIABILITY0
:FREQSCORE0.01640195:FREQCNT2
:TLTL(VVDVV0)
:SLTL(((|ssh|NN1))):OLT1LNIL:OLT2LNIL
:OLT3LNIL:LRL0))
Figure1:AnacquiredSCFforaverb“ftp”thelexiconoftheXTAGEnglishgrammar,andthencom-paredtheresultswiththoseobtainedbynaivefrequencycut-off.
Figure2:ProbabilitydistributionsofSCFsforapply2.2
ClusteringofVerbSCFDistributions
2
2.1
Background
AcquisitionofSCFsforLexicalizedGrammars
WestartbyacquiringSCFsforalexicalizedgrammarfromcorporabythemethoddescribedin(CarrollandFang,2004).
Intheirstudy,they rstacquire ne-grainedSCFsbythemethodproposedby(BriscoeandCarroll,1997;Ko-rhonen,2002).Figure1showsanexampleofoneac-quiredSCFentryforaverb“ftp.”EachacquiredSCFen-tryhasseveral eldsabouttheobservedSCF.Weexplainhereonlyitsportionrelatedtothisstudy.TheTARGET eldisawordstem(|ftp|inFigure1),the rstnumberintheCLASSES eldindicatesanSCFID(22inFigure1),andFREQCNTshowshowoftenwordsderivablefromthewordstemhadtheSCFidenti edbytheSCFID(2timesinFigure1)inthetrainingcorpus.TheobtainedSCFscomprisethetotal163typesofrelatively ne-grainedSCFs,whichareoriginallybasedontheSCFsintheANLT(BoguraevandBriscoe,1987)andCOMLEX(Gr-ishmanetal.,1994)dictionaries.Inthisexample,theSCFID22correspondstoanSCFofintransitiveverb.TheythenobtainSCFsforthetargetlexicalizedgram-mar(theLINGOEnglishResourceGrammar(Flickinger,2000)intheirstudy)byusingahandcraftedtranslationmapfromthese163typestooneofthetypesofSCFsinthetargetgrammar.Theyreportthattheycouldachieveacoverageimprovementof4.5%(52.7%to57.2%)withaparsingtimedouble(9.78sec.to21.78sec.).
Thisapproachiseasilyextensibletoanylexicalizedgrammars,ifthegrammarshaveanorganizedarchitec-tureoflexicon,whichderivepossiblelexicalentriesfromeachSCFthegrammarde nes.Existinglexicalizedgrammarsusuallyareequippedwiththiskindoforga-nization,e.g.,lexicaltypesinLINGOERGandtreefam-iliesintheXTAGEnglishgrammar.
TherearesomerelatedworkonclusteringofSCFprob-abilitydistributions(SchulteimWaldeandBrew,2002;Korhonenetal.,2003).Thesestudiesaimatobtainingverbsemanticclasses,whichcloselyrelatedtosyntacticbehaviorofargumentselection.
SchulteimWaldeandBrew(2002)employedcluster-ingofverbSCFdistributionstoinduceverbsemanticclasses.They rstrepresentaverbSCFdistributionbyann-dimensionalvectorforeachverb.EachelementintheSCFdistributionrepresentsaprobabilitythataverbappearswiththecorrespondingSCF.Theythenperformk-Meansclustering(Forgy,1965)ofthesevectorsinor-dertoobtainverbsemanticclasses.
Korhonenetal.(2003)alsoconductedclusteringofverbSCFdistributionsusingadifferentclusteringmethodincludingthenearestneighborsclusteringandtheInformationBottleneckclustering(Tishbyetal.,1999).Theyinvestigatedtheeffectofpolysemicverbsonclus-tering.
Althoughthesestudiesdemonstratedthatthereisacer-tainclassi cationofverbsbyclusteringofverbSCFdis-tributions,theydonotfocusontheimprovementofthequalityoftheSCFlexicon.Inthispaper,wefocusontheproblemtoidentifywhetherawordcanhaveeachSCFandtrytoobtainwordclasseswhoseelementwordshavethesamesetofSCFs.
3Method
Thebasicideaofourmethodis rsttoobtainwordclasseswhoseelementwordshavethesamesetofSCFs,usingnotonlyacquiredSCFsbutalsoexistingSCFsinthetargetgrammar.Wetheneliminateimplausibleac-quiredSCFsandaddplausibleunseenSCFsaccordingtothesetofSCFsrepresentedbythecentroidsoftheresult-ingclusters.3.1
RepresentationofCon denceValuesforSCFs
WerepresentanSCFcon dence-valuevectorofeachwordwiwithavectorvi,anobjectforclustering.Eachelementvijinvirepresentsthecon dencevalueofSCF
下一篇:一对一教务部工作规范手册