数据挖掘取样方法研究_胡文瑜(4)

发布时间:2021-06-05

48计算机研究与发展 2011,48(1)

6.WeightedSampling(加权取样)

在近似聚集查询处理中,当聚集属性的分布是偏斜的(skewed)以及低选择性时,加权取样能克服

均匀取样的局限性.加权取样是带权值的偏倚取样方法,使用率高的小数据集中的元组权重更大.权值的获得需借助工作负载信息.

Table1 ComparisonandCharacterizationofRepresentativeSamplingMethods

表1 代表性取样方法特性比较

RefYear[1]1985

Whether

AlgorithmsReservoirSampling

K

UniformBiasedK

Characteristics

One-passalgorithm.Itworkswellwhentheincomingdatacontainsonlyinsertsandupdatesbutrunsintodifficultiesifthedatacontainsdeletions.

BasedonBernoulliSchemes,itcanbeusedtoconstructweightedsamples.

Yes

Yes

SupportDeletionNo

WhethertheErrorBoundedYes

UsedtoStreamModel?Yes

[5]1989[6]1995[7]1998[7]1998[8]2000

APRSampling

ConciseSamplingCountSamplingCongressionalSamples

K

K

K

K

Theconcisesamplinghasloweroverheadsbutthecountsamplingismoreaccurate.Bothconciseandcountingsamplesprovidemoreaccurateestimationsforthesamefootprintthanreservoirsamples.Theirapplicationbackgroundishotlistqueries.

Overcomethelimitationsofuniformsamplingwhichleadstopooraccuracyofgroup-byqueriesforgroupswithveryfewdataitems.

No Few No

Yes No Yes

Yes Yes Yes

[21]2001[9]2001[10]2001[11]2002

StratifiedSamplingWeightedSamplingDVSamplingBackingSample

KKK

AkindoftechniqueofapproximatequeryprocessingapplicabletotraditionalDB.

Beingabletoextendthisideaevenfurthertostreamqueryprocessing.

Caneasilybedeployedforstreamanalysisorstreamminingininversedistributionsondatastreams.Fortheuseofincrementalmaintenanceofapproximatehistograms.Thebackingsampleresideonthediskandisaccessedveryrarely.

Thesamplecanptdirectlyhandlearbitrarysequenceofinsertionsanddeletions.

Randomizedalgorithm,miningfrequentitemsoverstream.

Deterministicalgorithm.

Thisalgorithmperforms

No

No

KK

FewYesYes

FewYes

[12]2002[13]2002[13]2002

Chain-SamplingStickSamplingLossyCounting

KKK

FewYes

NoYesYes

NoYesYes

betterthanStickySamplinginpracticealthough

theoretically,itsworst-casespacecomplexityisworse.

K

ADV-samplingalgorithmwhichcanhandlearbitrarysequenceofinsertionsanddeletions.

Yes

Yes

[14]2005DynamicInverseSampling

7.DistinctSampling(DV-sampling)

一类能对流查询中的唯一值(distinctvalue)进行聚集的取样技术的统称.Gibbons等人在单遍扫描中对数据中的DV值进行取样,与其他取样方法相比,DV-Sampling不容易遗漏关系表中稀少出现的属性值,在一定情况下,能更正确地评估唯一值

[11]

的数目并增量维护对数据的插入和删除.此外,一个已构建并存储好的DV-Sampling能被用于正确评估范围查询中唯一值的数目,或者用于带谓词查询

的唯一值数目估计.取样是基于掷硬币的随机过程,使用Hash函数来映射项目值所属级别,达到一定级别的元组才被取样.相较于DV-Sampling,简单随

精彩图片

热门精选

大家正在看