Meta-classifier approach to reliable text classification(17)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.4.CHAPTERSUMMARY
Removesuffixes(-e,-en)
If(String.length>1)
Removesuffixes(-ig,-ing,-baar,-bar,-heid,-etj,-tj)
If(String.length>1)
Replace(-opleiding→-opl,administratief→adm)
If(String.length>1)
Replacesuffixes(-v→-f,-z→-s)
Replacesuffixes(-dd→-d,-ff→-f,-gg→-g,-kk→-k,
-ll→-l,-mm→-m,-nn→-n,-pp→-p,-rr→-r,
-ss→-s,-tt→-t)
Figure2.1:Pseudocodeofthestemmingalgorithm.
occurringterms(administratiefandopleiding)totheirabbreviations(adm.
andopl.),whichalsofrequentlyoccur.
Thereductionofthenumberofattributesasaresultofstemmingisgiven
intable2.2.TheEducation1datasetandEducation2datasetcontainthesame
attributesexceptfortheclassattribute.Thereforetheresultsofthefeature
reductionarethesameandthetableshowsonlyoneEducationcolumnthat
representsbothdatasets.Forbothprofession lesthereductionisaround10%,
fortheeducation lesthereductionis6.4%.Besidesthisreductionalsoan
increaseinclassi cationperformancehasbeenobserved.Theperformancein-
creasedependsonthealgorithmandthedatasetused,itliesbetween0.15%
and1.60%.Dataset
Sample1
Sample2
Sample3
AverageEducation6.40%6.40%6.40%6.40%Profession110.00%10.14%9.97%10.04%Profession210.57%10.14%10.99%10.57%
Table2.2:Reductioninthenumberoffeaturesasaresultofstemming.
2.4ChapterSummary
ThefourCBSdatasetsprovideuswithchallengingtextclassi cationtasks,due
tothelargeamountsofrecords,attributes,andclasses.Randomsamplesfrom
10,000and20,000recordsaretakenfromthedatasetstoreducethecomputa-
tionalcomplexity.Furthermore,onlytheinformative eldshavebeenselected
andfeature-reductiontechniqueshavebeenapplied.Thefeature-reductiontech-
niquesincludedocument-frequencytresholding,removalofstopwordsandstem-
ming.Togethertheyleadtoafeaturereductionofapproximately60%.The
resultsofthedatapreparationarefourdatasetsthatarereducedinsizeand
canbeusedforclassi cation.
11
…… 此处隐藏:45字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术