Meta-classifier approach to reliable text classification(13)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Chapter2
TheCBSData
TheCBShasprovidedthreedatasetsforthisresearch,whichareusedtoassesstheperformanceofthemeta-classi erapproach.Section2.1givessomeback-groundinformationaboutthetaskandthedatasets.Insection2.2wedescribethecontentsandthecharacteristicsofthesedatasets.Thedatapreparationprocess,whichincludesselectionofinformative eldsandfeaturereduction,isdiscussedinsection2.3.Finally,insection2.4wesummarizethechapter.
2.1Backgrond
TheCBSistheDutchnationalstatisticsinstitute.Amongotherthings,theCBScollectsdatabyinterviewersusingquestionnaires.Theprovideddatasetscontaintheanswersonopen-endedquestionsthatcanbefoundintheseques-tionnaires.Theyconcerntheeducationortheprofessionofindividuals.ThesedatatypicallyhaveatextualformatandareinDutch.SincetheCBSwishestoconvertthedataintousefulstatisticalinformation,thetextualdatahavetobetransformedintoclasses.Ourtaskistoassignasymboliccodefromaprede nedsetofsuchcodestothetextrecordscontainingtheanswersgiveninthequestionnaire.Thecodesforprofessionsandeducationsarede nedbytheCBSthroughso-calledSBCandSOI+codesrespectively[CBS,DivisieSocialeenRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning,2003,2001].Sofarmostofthedataismanuallyclassi ed,liketheprovideddatasets.Inpreviousresearchthepossibilitiesforautomatictextclassi cationhavebeenexplored[Smirnovetal.,2003a].Onthewhole,littleresearchhasbeencar-riedouttoclassifyanswerstoopen-endedquestionsinquestionnaires.Somedictionary-basedapproacheshavebeendeveloped.Theyusespecializeddic-tionariesthatassignatextfragmentautomaticallytoaspeci ccategoryifitcontainswordsmatchingthoseinthedictionaryrelevanttothecategory.Thetextfragmentscanbepreprocessedusingdi erentparsingtechniquessuchasdeletionoftrivialwords,su xremoval,andde nitionofsynonyms[MacchiaandD’Orazio,2000].Drawbacksofthedictionary-basedapproachesarethattheyarestaticandspecializeddictionarieshavetobedevelopedmanuallybyexpertsforeachcategory.Atext-classi cationapproachbasedonsupervisedmachinelearningisabettersolutioninthiscase,sincetherearenospecializeddictionariesavailablebutalargebodyofmanuallycodedtrainingdataison
7
…… 此处隐藏:342字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术