Meta-classifier approach to reliable text classification(14)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.2.DATADESCRIPTION#attributes
persample
~5,400
~3,300
~5,200
~2,900
~4,650
~2,900#classespersample~1,250~1,050~1,250~2,800~4,0508DatasetProfession1Profession2Education1Education2#instances192,46663,82349,09249,092#classes1,5631,5785,9918# elds551111samplesize~19,150~9,450~19,000~9,800~19,600~9,800
Table2.1:Descriptionofthedatasets.
hand[GiorgettiandSebastiani,2003].Adisadvantageofthetext-classi cation
approachisthatwhenthede nitionofthecodeschanges,thetrainingdatais
nolongerrelevant,andanewbodyofmanuallycodedtrainingdatahastobe
provided.
2.2DataDescription
TheCBShasprovidedthreedatasetsforthisresearch,Profession1,Profession2,
andEducation1.DatasetsProfession1andProfession2containtextrecordsde-
scribingprofessionsandhavesimilarstructures.Thetextrecordsintheprofes-
siondatasetscontain ve elds.Each eldcontainstheansweronaparticular
question,forexamplethecompanysomeoneworksfororsomeone’sdailyac-
tivities.ClassvaluesofthesedatasetsareSBCcodes(seeCBS,DivisieSociale
enRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning[2001]foradescriptionoftheSBCcodes).
Education1isthethirddataset,itcontainstextrecordsdescribingeduca-
tions.Thetextrecordsinthisdatasetconsistofeleven elds.These elds
containamongotherthingsthenameandlevelofeducation.Classvaluesof
thisdatasetareSOI+codes1.
AfourthdatasetisderivedfromtheEducation1dataset.Inallinstancesof
thisdatasettheoriginalclassvalueoftheEducation1datasethasbeenreplaced
withthe rstdigitofthisclassvalue,allother eldsremainthesame.The
resultistheEducation2dataset,whichhasonly8classvalues.
Intable2.1thebasicstatisticsofthedi erentdatasetscanbefound.
Allfourdatasetsarerandomlysampledfortestingtoreducethecomputa-
tionalrequirements.Fromthedatasets,samplesofapproximately10,000and
20,000instancesaretaken.Thenumberofattributesinthetableisthenumber
ofattributesafterthefeaturereductionthatwillbedescribedinthefollowing
SOI+code,consistsoffourparts,e.g.,603150-1-NVT-DOCTORAAL.The rstsix
digitsareaso-calledSOI-code.The rstdigitoftheSOI-coderepresentsthelevelofeduca-
tion,theseconddigitrepresentsthesublevelofeducation.Thethirddigitandfourthdigit
togetherrepresenttheeducation eld,andthe fthdigitandsixthdigittogetherrepresent
theeducationsub eld.ThesecondpartoftheSOI+codeisaserialnumberfollowingthe
SOI-code,designedtomaptheSOI-codesunambiguouslyontotheinternationalISCED(In-
ternationalStandardClassi cationofEducation)codes.Thethirdpartistheeducationtrack
andthefourthpartistheuniversityphase[CBS,DivisieSocialeenRuimtelijkeStatistieken,
SectorOntwikkelingenOndersteuning,2003].1The
8
…… 此处隐藏:802字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术