Meta-classifier approach to reliable text classification(16)
时间:2026-01-21
时间:2026-01-21
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.3.DATAPREPARATION
Document-frequencytresholdingandremovalofstopwordsareamongthe
simplesttechniquesforfeaturereduction.Thedocumentfrequencyofaword
isthenumberofdocumentsinwhichthewordoccurs.Eachwordthathasa
document-frequencylessthansomeprede nedthresholdisremovedfromthe
featurespace[YangandPedersen,1997].Inourresearchallwordsthatoccur
inonlyonetextrecordareremoved,whichleadstoanaveragereductioninthe
numberofattributesaround50%.
Theremovaloffunction-orstopwordsisanothercommontechniquethat
isapplied.Stopwordsaretopic-neutralwordssuchasarticles,prepositions,
conjunctions,andabbreviations,whichdonotcontributetoclassi cationper-
formance[Sebastiani,2002].Thelistofstopwordscontains89topic-neutral
words.Thereductioninthenumberoffeaturesisnotsobig,butthenumber
ofinstancesinwhichoneormorestopwordsareremoved,islarge.
Alessfrequentlyusedpre-processingtechniquethatweapply,isstemming.
Thegoalofstemmingistoimprovetheperformanceofinformation-retrieval
techniquessuchastextcategorizationbybringingunderoneheadingvariant
formsofawordwhichshareacommonmeaning.Stemmingcanreducethe
numberoffeaturesandimproveclassi cationperformanceatthesametime.In
generaltherearetwostemmingtechniques.The rsttechniqueisalgorithmic:
stemmingisdoneaccordingtocertainruleswhichstripsu xes,pre xes,or
in xes.Thesecondtechniqueisdictionary-basedstemming.Inthiscasethe
stemmedversionofawordcanbelookedupinaspecialdictionary.Since
dictionarybasedstemmingisslowerthanalgorithmicstemmingandaDutch
stemmingdictionaryisnotreadilyavailable,wewillusealgorithmicstemming
inthisresearch[Porter,2001].
KraaijandPohlmann[1995]developedaDutchversionofthewell-known
Porterstemmer.Asimpli edversionofthisalgorithmhasbeenimplemented
forthisproject.Thefollowingtworulesoftheoriginalalgorithmhavebeen
omitted.
Removalofthea x‘ge’whichoccursasapre xorin xinDutchpartici-
ples.Inthedataconcernedhereparticiplesarerare,attributabletothe
formulationofthequestions.Mostoftheverbsareintheircompleteform.
Removalofthea x‘ge’wouldbeinappropriateinmostofthecases.
Duplicationofavowelinaclosedsyllable.InDutchlongvowelsarespelled
singleinopensyllables,e.g.,kopen,anddoubleinclosedsyllables,e.g.,
koop.Afterremovalofsomea xes,e.g.,lop(-en)thestemvowelneedsto
bedoubledtorenderanorthographicallycorrectstem.Althoughthispro-
cedurecanleadtothestemmingofmorewordstothesamestem,itisnot
necessaryfortheautomaticclassi erthatthewordsareorthographically
correct.Inadditiontheproceduredoesnotalwaysworkcorrectly,whene
isthevowelinanunstressedsyllableitisneverdoubled.Forexamplethe
correctstemofkantelencanbeeitherkanteelorkantel.Withoutword
stressinformationitisimpossibletopredictconsistentlythestatusofe
correctly.
Theresultingalgorithmconsistsofaseriesofsu xstrippers,thepseudo
codeofthealgorithmcanbefoundin gure2.1.Thestringinthispseudocode
isthewordthatisstemmed.Besidesthesu xstrippersofKraaijandPohlman,
twostrippersspeci ctothisdomainareadded.Theystriptwofrequently
10
…… 此处隐藏:1175字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:第八章 收银员的礼仪
下一篇:浅析网络安全技术