Meta-classifier approach to reliable text classification(19)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
3.2.CLASSIFICATIONALGORITHMS
Charactern-grams.Featuresareformedbyallsequencesofncharacters
inatext.Acommonvaluefornis3.Thecharactern-gramrepresentation
wasestablishedquitesomeyearsago[Shannon,1951].Anadvantageof
thisrepresentationisthatitcopesbetterwithspellingmistakes.
Besidesthefeatures,alsothefeatureweightshavetobedetermined.When
afeatureoccursinacertaindocument,ameasurehastodeterminethevalue
ofthisfeatureforthedocument.Someearlyna¨ veBayesapproachesusea
binary-valuedvectorrepresentation[RobertsonandSparck-Jones,1976].The
featurevalueequals1ifthefeatureoccursinthedocument,and0otherwise.
Informationinherentinthefrequenciesoffeaturesislostwhenonlytheabsence
orpresenceoffeaturesisconsidered.Thereforenewapproachesthattakeinto
accountfrequencyoffeaturesinadocumenthavebeendeveloped[Lewis,1998].
Themoststraightforwardadaptationistoconsidertheweightofafeatureequal
tothenumberofoccurrencesofthefeatureinthedocument.Thisweighting
functionisreferredtoastermfrequency(tf)[Lewis,1992].Thetffunction
encodestheintuitionthatthemoreoftenafeatureoccursinadocumentthe
moreitisrepresentativeofitscontent.Thesecondpopularweightingfunction
isthetfidfweightingfunction.Itistheproductofthetermfrequencyandthe
inversedocumentfrequency(idf).Thedocumentfrequency(df)ofafeature
isthenumberofdocumentsinthedatasetinwhichthefeatureoccursatleast
once.Theinversedocumentfrequencycanbecalculatedfromthedocument
frequencyasfollows:
idf(fi)=log|N|
i,where
|N|isthetotalnumberofdocumentsinthedataset,anddf(fi)isthedocument
frequencyoffeature(fi).Theidffunctionencodestheintuitionthatthemore
documentsafeatureoccursin,thelessdiscriminatingitis[Sebastiani,2002,
Smirnovetal.,2003b].Variationsonthetfidfweightingfunctioncanbecre-
atedthroughapplyingdi erentlogarithms,normalizations,andothercorrection
factors.
Inthisresearchweusethefollowingtextrepresentation.Toconvertatext
documentintoaninstancesuitableforclassi cationthevectorofallwords
thatoccurinthedatasetisthefeaturespace.PreviousCBSresearchshows
thatwordsworkwellasfeaturesfortheProfession1dataset,thereforeinthis
researchweusewordsasfeatures[Smirnovetal.,2003a].Thefeatureweightsof
theinstancesarecomputedbyatforatfidfweightingfunction.Weonlyuse
theweightingfunctionsthattakeintoaccountfrequenciesoffeaturesinorder
nottoloseinformationinherentinthefrequenciesoffeatures.
3.2Classi cationAlgorithms
Whenthetextofadocumentistransformedintoacompactrepresentation,in
principleanymachine-learningalgorithmcanbeapplied.Somespecialtextclas-
si cationalgorithmsthatincorporatefeaturede nitionsandweightingfunctions
havebeenproposed,forexampletheRocchioclassi er[Rocchio,1971].Theclas-
si cationalgorithmsthatareexaminedinthisresearch,i.e.,na¨ veBayesand
nearestneighbour,canuseeitherthetforthetfidfweightingfunction.
13
…… 此处隐藏:1007字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术