Meta-classifier approach to reliable text classification(16)

时间:2026-01-21

A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl

2.3.DATAPREPARATION

Document-frequencytresholdingandremovalofstopwordsareamongthe

simplesttechniquesforfeaturereduction.Thedocumentfrequencyofaword

isthenumberofdocumentsinwhichthewordoccurs.Eachwordthathasa

document-frequencylessthansomeprede nedthresholdisremovedfromthe

featurespace[YangandPedersen,1997].Inourresearchallwordsthatoccur

inonlyonetextrecordareremoved,whichleadstoanaveragereductioninthe

numberofattributesaround50%.

Theremovaloffunction-orstopwordsisanothercommontechniquethat

isapplied.Stopwordsaretopic-neutralwordssuchasarticles,prepositions,

conjunctions,andabbreviations,whichdonotcontributetoclassi cationper-

formance[Sebastiani,2002].Thelistofstopwordscontains89topic-neutral

words.Thereductioninthenumberoffeaturesisnotsobig,butthenumber

ofinstancesinwhichoneormorestopwordsareremoved,islarge.

Alessfrequentlyusedpre-processingtechniquethatweapply,isstemming.

Thegoalofstemmingistoimprovetheperformanceofinformation-retrieval

techniquessuchastextcategorizationbybringingunderoneheadingvariant

formsofawordwhichshareacommonmeaning.Stemmingcanreducethe

numberoffeaturesandimproveclassi cationperformanceatthesametime.In

generaltherearetwostemmingtechniques.The rsttechniqueisalgorithmic:

stemmingisdoneaccordingtocertainruleswhichstripsu xes,pre xes,or

in xes.Thesecondtechniqueisdictionary-basedstemming.Inthiscasethe

stemmedversionofawordcanbelookedupinaspecialdictionary.Since

dictionarybasedstemmingisslowerthanalgorithmicstemmingandaDutch

stemmingdictionaryisnotreadilyavailable,wewillusealgorithmicstemming

inthisresearch[Porter,2001].

KraaijandPohlmann[1995]developedaDutchversionofthewell-known

Porterstemmer.Asimpli edversionofthisalgorithmhasbeenimplemented

forthisproject.Thefollowingtworulesoftheoriginalalgorithmhavebeen

omitted.

Removalofthea x‘ge’whichoccursasapre xorin xinDutchpartici-

ples.Inthedataconcernedhereparticiplesarerare,attributabletothe

formulationofthequestions.Mostoftheverbsareintheircompleteform.

Removalofthea x‘ge’wouldbeinappropriateinmostofthecases.

Duplicationofavowelinaclosedsyllable.InDutchlongvowelsarespelled

singleinopensyllables,e.g.,kopen,anddoubleinclosedsyllables,e.g.,

koop.Afterremovalofsomea xes,e.g.,lop(-en)thestemvowelneedsto

bedoubledtorenderanorthographicallycorrectstem.Althoughthispro-

cedurecanleadtothestemmingofmorewordstothesamestem,itisnot

necessaryfortheautomaticclassi erthatthewordsareorthographically

correct.Inadditiontheproceduredoesnotalwaysworkcorrectly,whene

isthevowelinanunstressedsyllableitisneverdoubled.Forexamplethe

correctstemofkantelencanbeeitherkanteelorkantel.Withoutword

stressinformationitisimpossibletopredictconsistentlythestatusofe

correctly.

Theresultingalgorithmconsistsofaseriesofsu xstrippers,thepseudo

codeofthealgorithmcanbefoundin gure2.1.Thestringinthispseudocode

isthewordthatisstemmed.Besidesthesu xstrippersofKraaijandPohlman,

twostrippersspeci ctothisdomainareadded.Theystriptwofrequently

10

…… 此处隐藏:1175字,全部文档内容请下载后查看。喜欢就下载吧 ……
Meta-classifier approach to reliable text classification(16).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:4.9 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:19元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219