数据挖掘取样方法研究_胡文瑜(4)
发布时间:2021-06-05
发布时间:2021-06-05
48计算机研究与发展 2011,48(1)
6.WeightedSampling(加权取样)
在近似聚集查询处理中,当聚集属性的分布是偏斜的(skewed)以及低选择性时,加权取样能克服
均匀取样的局限性.加权取样是带权值的偏倚取样方法,使用率高的小数据集中的元组权重更大.权值的获得需借助工作负载信息.
Table1 ComparisonandCharacterizationofRepresentativeSamplingMethods
表1 代表性取样方法特性比较
RefYear[1]1985
Whether
AlgorithmsReservoirSampling
K
UniformBiasedK
Characteristics
One-passalgorithm.Itworkswellwhentheincomingdatacontainsonlyinsertsandupdatesbutrunsintodifficultiesifthedatacontainsdeletions.
BasedonBernoulliSchemes,itcanbeusedtoconstructweightedsamples.
Yes
Yes
SupportDeletionNo
WhethertheErrorBoundedYes
UsedtoStreamModel?Yes
[5]1989[6]1995[7]1998[7]1998[8]2000
APRSampling
ConciseSamplingCountSamplingCongressionalSamples
K
K
K
K
Theconcisesamplinghasloweroverheadsbutthecountsamplingismoreaccurate.Bothconciseandcountingsamplesprovidemoreaccurateestimationsforthesamefootprintthanreservoirsamples.Theirapplicationbackgroundishotlistqueries.
Overcomethelimitationsofuniformsamplingwhichleadstopooraccuracyofgroup-byqueriesforgroupswithveryfewdataitems.
No Few No
Yes No Yes
Yes Yes Yes
[21]2001[9]2001[10]2001[11]2002
StratifiedSamplingWeightedSamplingDVSamplingBackingSample
KKK
AkindoftechniqueofapproximatequeryprocessingapplicabletotraditionalDB.
Beingabletoextendthisideaevenfurthertostreamqueryprocessing.
Caneasilybedeployedforstreamanalysisorstreamminingininversedistributionsondatastreams.Fortheuseofincrementalmaintenanceofapproximatehistograms.Thebackingsampleresideonthediskandisaccessedveryrarely.
Thesamplecanptdirectlyhandlearbitrarysequenceofinsertionsanddeletions.
Randomizedalgorithm,miningfrequentitemsoverstream.
Deterministicalgorithm.
Thisalgorithmperforms
No
No
KK
FewYesYes
FewYes
[12]2002[13]2002[13]2002
Chain-SamplingStickSamplingLossyCounting
KKK
FewYes
NoYesYes
NoYesYes
betterthanStickySamplinginpracticealthough
theoretically,itsworst-casespacecomplexityisworse.
K
ADV-samplingalgorithmwhichcanhandlearbitrarysequenceofinsertionsanddeletions.
Yes
Yes
[14]2005DynamicInverseSampling
7.DistinctSampling(DV-sampling)
一类能对流查询中的唯一值(distinctvalue)进行聚集的取样技术的统称.Gibbons等人在单遍扫描中对数据中的DV值进行取样,与其他取样方法相比,DV-Sampling不容易遗漏关系表中稀少出现的属性值,在一定情况下,能更正确地评估唯一值
[11]
的数目并增量维护对数据的插入和删除.此外,一个已构建并存储好的DV-Sampling能被用于正确评估范围查询中唯一值的数目,或者用于带谓词查询
的唯一值数目估计.取样是基于掷硬币的随机过程,使用Hash函数来映射项目值所属级别,达到一定级别的元组才被取样.相较于DV-Sampling,简单随