Meta-classifier approach to reliable text classification(20)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
3.2.CLASSIFICATIONALGORITHMS
Inthisresearchwelookforpracticalsolutionsthathavesmallspaceand
timecomplexity.Weonlyconsiderclassi cationalgorithmsthatmeetthese
requirements,e.g.,supportvectormachinesarenotconsideredalthoughitisa
promisingtextclassi cationtechnique[Joachims,1998].Inthefollowingthree
subsectionsthena¨ veBayesalgorithm,thenearestneighbouralgorithmanda
hierarchicalapproachtotextclassi cationarediscussed.
3.2.1Na¨ veBayes
Thena¨ veBayesclassi erassumesthatallattributesaremutuallyindependent.
Thereforetheparametersforeachattributecanbelearnedseparatelyandthis
simpli esthetaskconsiderably.Especiallyintextclassi cationtasksthattypi-
callyhavealargenumberofattributesthisisbene cial.Duringtrainingofthe
na¨ veBayesclassi er,thepriorprobabilitiesoftheclasses,andtheprobabili-
tiesofobservingattributevaluesgiventheclass,areestimatedbasedontheir
frequenciesoverthetrainingdata.Anewinstanceisclassi edbyusingBayes
theoremtoidentifythemostprobableclass[Mitchell,1997].
3.2.2NearestNeighbour
Thenearestneighbourclassi erdoesnotbuildanexplicit,declarativerepresen-
tationoftheclasses,butsimplystoresallthetraininginstances.Toclassifya
newinstanceitreliesontheclassvaluesattachedtothetraininginstancesthat
aresimilartothenewinstance.Theknearesttraininginstancestakeavote
tosettleontheclassvaluewhentheclassisnominal.Themaindrawbackof
thisapproachisitsine ciencyatclassi cationtime.Toclassifyanewinstance,
theentiretrainingsetneedstoberankedforsimilaritywiththenewinstance
[Mitchell,1997].
Inthenearestneighbourclassi erkistheonlyimportantparameterthat
needstobeset.ExperimentsontheCBSdatasetsshowthatthevaluek=
1performsbest,sointheremainingofthisthesiswewillusethe1-nearest
neighbourclassi erforthetextclassi ers.
3.2.3HierarchicalClassi er
Adi cultyofthedataprovidedbytheCBS,isthelargenumberofclasses.
TheEducation1datasethasalmost6,000classes.Thisisoneofthereasons
thatclassi cationisrathertimeandspaceconsuming.Toovercomethisprob-
lemwehaveconstructedahierarchicalclassi erthatiscomputationallymore
e cient.Theclassattributecanbedividedintomultiplesubclassesthatcan
beclassi edseparately.Eachofthesesubclasseshasasmallnumberofclass
values.Assumingthattheclassvaluecanbedividedintoksubclassesandeach
subclasshasndistinctclassvalues.Thehierarchicalclassi erclassi esktimes
nclasses,whileaplainclassi erclassi esnkclasses,whichformostclassi ers
iscomputationallymoredemanding.
Mosthierarchicalclassi ersrequireastrictconcepthierarchylikeChuang
etal.[2000].However,ourdatasetcontainsmorehierarchies,ratherthanone.
Theeducationlevelisamoregeneralconceptthantheeducationsublevel,and
likewisetheeducation eldismoregeneralthantheeducationsub eld.
14
…… 此处隐藏:907字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术