Meta-classifier approach to reliable text classification(18)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Chapter3
TextClassi cation
Inthepreviouschapterwedescribedtheclassi cationtasksoftheCBSandthedatapreparationprocess.Intherestofthethesiswefocusonreliabletextclassi cationfortheCBS.The rststepinthemeta-classi erapproachistogenerateabaseclassi er.Thischapterstudiesthebasetextclassi ersderivedontheCBSdata.Weanswerour rstresearchquestion,thatisconcernedwith ndingane cientandaccuratetextclassi er.Section3.1elaboratesontextrepresentationandsection3.2discussesclassi cationalgorithms.Insection
3.3experimentswithtextclassi ersontheCBSdataaredescribed.Finally,insection3.4wegiveasummaryofthischapter.
3.1TextRepresentation
Textdocumentstypicallyconsistofstringsofcharactersthatcannotbedi-rectlyinterpretedbyaclassi er.Eachdocumentneedstobetransformedintoacompactrepresentationofitscontentappropriateforautomaticclassi cation.Thevectorrepresentationisthemostfrequentlyuseddocumentrepresenta-tion[Mladeni´c,1999].Adocumentwillberepresentedasavectorofweightsdj=<w1j,...,wmj>,whereMisthesetoffeaturesthatoccuratleastonceinatleastonedocumentinthedatasetandtheweights,0≤wkj≤1representhowmuchfeaturefkcontributestothesemanticsofdocumentdj.Di erentwaystode nefeaturesandtocomputefeatureweightsleadtodi erenttextrepresen-tationapproaches[Sebastiani,2002].Fourpossibilitiestode nefeaturesareasfollows.
rmationretrievalresearchsuggeststhatwordsworkwellasfeaturesandthattheirorderinginarecordisofminorimportanceformanytasks[Lewis,1992].Whentheweightsofthefeaturesarebinary,identifyingfeatureswithwordsiscalledthebag-of-wordsapproach.
ingphrasesasfeatureshasproventobenotverysuccessful.Di erentwaysofde ningphraseswereinvestigatedbyLewis[1992].
Wordn-grams.Afeatureisde nedforeachwordsequencecontaininguptonconsecutivewords.Thisrepresentationworksbestifthesequenceofwordsisimportantforthecontextandthedocumentsareshort[Mladeni´c,1998].
12
…… 此处隐藏:76字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术