Estimating the quality of data in relational databases(7)
发布时间:2021-06-07
发布时间:2021-06-07
aviewmaporarelationmap,asappropriate.Now,thetaskistopartitionthistwo-dimensionalarrayintoareasinwhichelementsaredistributedhomogeneouslywithrespecttoourqualitymeasures.
Notethatthecorrectnessofaparticularnonkeyattributevaluecanbedeterminedonlyinreferencetothekeyattributeofthattuple,i.e.,indeterminingwhetheraspeci ccellshouldbe0or1weconsiderthecorrectnessofthepair:(keyvalue;nonkeyvalue)determiningthecorrectnessofanattributevalue.Thepairiscorrectifandonlyifbothelementsofthepairarecorrect.Thismeans,inparticular,thatifakeyattributevalueisincorrect,thenallpairscorrespondingtothiskeyattributevalueareconsideredincorrect.
ThetechniqueweuseforpartitioningtheviewmapisanonparametricstatisticalmethodcalledCART(Classi cationandRegressionTrees)[2].Thismethodhasbeenwidelyusedfordataanalysisinbiology,socialscience,environmentalresearch,andpatternrecognition.Closertoourarea,thismethodwasusedin[3]forestimatingtheselectivityofselectionqueries.Weassumethattuplesandattributesofarelationareordereduniquely.
4.2HomogeneityMeasure
Intuitively,aviewisperfectlyhomogeneouswithrespecttoagivenpropertyifeverysubviewoftheviewcontainsthesameproportionofpairswiththispropertyastheviewitself.Moreover,themorehomogeneousaview,thecloseritsdistributionofthepairswiththegivenpropertyistothedistributionintheperfectlyhomogeneousview.Hence,thedi erencebetweentheproportionofthepairswiththegivenpropertyintheviewitselfandineachofitssubviewscanbeusedtomeasurethedegreeofhomogeneityofthegivenview.
Speci cally,letv¯denoteanextensionofaviewofarelationinastoreddatabase,letv1,...,vNbethesetofallpossibleprojection-selectionviewsofv¯,lets(¯v)ands(vi)denotetheproportionofpairsinviewsv¯andvi(i=1,...,N),respectively,thatoccurintheircorrespondingidealrepresentations(i.e.,proportionsofcorrectpairsintheseviews).Then1 (s(¯v) s(vi))2
Nvi v¯
measuresthehomogeneityoftheviewv¯withrespecttosoundness.Thehomogeneitywithrespecttocompletenessisde nedanalogously.Similarmeasuresofhomogeneitywerepro-posedin[6,3].
Duetothelargenumberofpossibleviews,computationofthesemeasuresisoftenpro-hibitivelyexpensive.TheGiniindex[2,3]wasproposedasasimplealternativetothesehomogeneitymeasures.
Consideraviewv¯andarelationmapM.WecallthepartofMthatcorrespondsto
3v¯anode.TheGiniindexofthisnode,denotedG(¯v),is2p(1 p),wherepdenotesthe3Weusethetermsnodeandviewinterchangeably.
上一篇:三结合教育工作总结
下一篇:理性的批判和道义的批判