Meta-classifier approach to reliable text classification(14)

时间:2026-01-21

A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl

2.2.DATADESCRIPTION#attributes

persample

~5,400

~3,300

~5,200

~2,900

~4,650

~2,900#classespersample~1,250~1,050~1,250~2,800~4,0508DatasetProfession1Profession2Education1Education2#instances192,46663,82349,09249,092#classes1,5631,5785,9918# elds551111samplesize~19,150~9,450~19,000~9,800~19,600~9,800

Table2.1:Descriptionofthedatasets.

hand[GiorgettiandSebastiani,2003].Adisadvantageofthetext-classi cation

approachisthatwhenthede nitionofthecodeschanges,thetrainingdatais

nolongerrelevant,andanewbodyofmanuallycodedtrainingdatahastobe

provided.

2.2DataDescription

TheCBShasprovidedthreedatasetsforthisresearch,Profession1,Profession2,

andEducation1.DatasetsProfession1andProfession2containtextrecordsde-

scribingprofessionsandhavesimilarstructures.Thetextrecordsintheprofes-

siondatasetscontain ve elds.Each eldcontainstheansweronaparticular

question,forexamplethecompanysomeoneworksfororsomeone’sdailyac-

tivities.ClassvaluesofthesedatasetsareSBCcodes(seeCBS,DivisieSociale

enRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning[2001]foradescriptionoftheSBCcodes).

Education1isthethirddataset,itcontainstextrecordsdescribingeduca-

tions.Thetextrecordsinthisdatasetconsistofeleven elds.These elds

containamongotherthingsthenameandlevelofeducation.Classvaluesof

thisdatasetareSOI+codes1.

AfourthdatasetisderivedfromtheEducation1dataset.Inallinstancesof

thisdatasettheoriginalclassvalueoftheEducation1datasethasbeenreplaced

withthe rstdigitofthisclassvalue,allother eldsremainthesame.The

resultistheEducation2dataset,whichhasonly8classvalues.

Intable2.1thebasicstatisticsofthedi erentdatasetscanbefound.

Allfourdatasetsarerandomlysampledfortestingtoreducethecomputa-

tionalrequirements.Fromthedatasets,samplesofapproximately10,000and

20,000instancesaretaken.Thenumberofattributesinthetableisthenumber

ofattributesafterthefeaturereductionthatwillbedescribedinthefollowing

SOI+code,consistsoffourparts,e.g.,603150-1-NVT-DOCTORAAL.The rstsix

digitsareaso-calledSOI-code.The rstdigitoftheSOI-coderepresentsthelevelofeduca-

tion,theseconddigitrepresentsthesublevelofeducation.Thethirddigitandfourthdigit

togetherrepresenttheeducation eld,andthe fthdigitandsixthdigittogetherrepresent

theeducationsub eld.ThesecondpartoftheSOI+codeisaserialnumberfollowingthe

SOI-code,designedtomaptheSOI-codesunambiguouslyontotheinternationalISCED(In-

ternationalStandardClassi cationofEducation)codes.Thethirdpartistheeducationtrack

andthefourthpartistheuniversityphase[CBS,DivisieSocialeenRuimtelijkeStatistieken,

SectorOntwikkelingenOndersteuning,2003].1The

8

…… 此处隐藏:802字,全部文档内容请下载后查看。喜欢就下载吧 ……
Meta-classifier approach to reliable text classification(14).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:4.9 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:19元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219