Meta-classifier approach to reliable text classification(15)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.3.DATAPREPARATION
section.
Thetextclassi cationtasksoftheCBSdi erfromacommontextclassi-
cationtaskintwoaspects.Firstly,thenumberofclassesisverylarge.The
Education1datasethasapproximately6,000classes,theprofessiondatasets
havealmost1,600classes.Secondly,thesizeoftheinstancesissmaller.While
mostclassi cationtasksconsiderwholedocuments,inthesetasksaninstance
consistsofonlyafewwords.Theinstancesdonotcontaincompletephrases.
Thesetwocharacteristicshaveconsiderableconsequencesontheperformance
ande ciencyofthetextclassi cationalgorithms.
2.3DataPreparation
Besidessamplingothertechniquesareusedtofurthereasethecomputational
complexity.Inthefollowingsubsectionstheselectionofinformative eldsand
di erentfeaturereductiontechniquesisdiscussed.
2.3.1SelectionofInformativeFields
RecordsintheCBSdatasetsconsistofmultiple eldsthatmightnotallbe
informative.Oneofthetasksinthepreviousresearchconsistedofassessingthe
contributionofthetext eldstotheclassi cationperformance.Inturneach
text eldoftheProfession1datasetwasleftouttodeterminethee ectonthe
accuracyofthetextclassi ers.Theconclusionofthistaskwasthatonlytwo
text eldsarerequiredfortheclassi cationofprofessions.Thesetext elds
containtheprofessionandthedailyactivitiesofanindividual[Smirnovetal.,
2003a].Weusethesametwotext eldstoclassifytherecordsintheProfession2
dataset,astheProfession2datasetcontainsthesame eldsastheProfession1
dataset.
Toidentifywhich eldsareinformativeintheeducationdatasets,eachtext
eldwas,alsointurn,leftouttodeterminethee ectontheaccuracy.Itturned
outthatleavingoutcertaintext eldsdidnotleadtoasigni cantimprovement
inclassi cationaccuracy.Hencealltext eldsareusedforclassi cationof
theEducation1andEducation2dataset.Anexplanationforthisresultisthat
onlyafew eldsare lledforeachrecordintheeducationdatasets,most elds
areemptyforthemajorityofrecords.Fieldscanbeemptybecausetheyare
irrelevantorunknownforacertainindividual.
2.3.2Feature-Reduction
Amajorproblemintextcategorizationisthehighdimensionalityofthefeature
space.Inourcasethefeaturespaceconsistsofalluniquewordsthatoccurin
thedatasets.Forexample,inasampleof20,000records,around15,000unique
wordsarepresent.Aninstanceisrepresentedbyallwordsthatoccurinthe
informative eldsofthetextrecord.Featurereductioncannotonlyreducetime
andspacecomplexity,butalsotheclassi cationperformancecanbeincreasedby
removingirrelevantanduninformativefeatures.Wewillusefeature-reduction
basedondocument-frequency,removalofstopwordsandstemmingtoreduce
thenumberoffeatures.
9
…… 此处隐藏:740字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术