COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1

发布时间:2021-06-06

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactionsTID Items

Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Implication means co-occurrence, not causality!

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Definition: Frequent Itemset

Itemset– A collection of one or more items

Example: {Milk, Bread, Diaper}TID Items

– k-itemset

An itemset that contains k items

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2

Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Definition: Association Rule

Association Rule– An implication expression of the form X Y, where X and Y are itemsets

TID

Items

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

– Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

Example:

{Milk, Diaper} Beers

– Confidence (c)

Measures how often items in Y appear in transactions that contain X

(Milk , Diaper, Beer )|T|

2 0.4 5

(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #

© Tan,Steinbach, Kumar

Introduction to Data Mining

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold

Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Mining Association RulesTID Items

Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5

)

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence

Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Mining Association Rules

Two-step approach:1. Frequent Itemset Generation–

Generate all itemsets whose support minsup

2. Rule Generation–

Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Frequent Itemset Generationnull

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets4/18/2004 #

© Tan,Steinbach, Kumar

Introduction to Data Mining

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Frequent Itemset Generation

Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

List of Candidates

N

M

w

– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Computational Complexity

Given d unique items:– Total number of itemsets = 2d – Total number of possible association rules:

d d k R k j 3 2 1d 1 k 1 d k j 1 d d 1

If d=6, R = 602 rules

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Frequent Itemset Generation Strategies

Reduce the number of candidates (M)– Complete search: M=2d – Use pruning techniques to reduce M

Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Reducing Number of Candidates

Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent

Apriori principle holds due to the following property of the support measure:

X ,Y : ( X Y ) s( X ) s(Y )– Support of an itemset never exceeds the support of its su

bsets – This is known as the anti-monotone property of support© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Illustrating Apriori Principlenull

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

Found to be InfrequentABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD

ABCE

ABDE

ACDE

BCDE

Pruned supersets© Tan,Steinbach, Kumar Introduction to Data Mining

ABCDE

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Illustrating Apriori PrincipleItem Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1

Items (1-itemsets)

Minimum Support = 3If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Count 3 2 3 2 3 3

Pairs (2-itemsets)(No need to generate candidates involving Coke or Eggs)

Triplets (3-itemsets)Itemset {Bread,Milk,Diaper} Count 3

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Apriori Algorithm

Method:– Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate

length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Reducing Number of Comparisons

Candidate counting:– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structureInstead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

TransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Hash Structure

N

k

Buckets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Introduction to Hash Functions

A Hash Function h is a mapping from a set X to a range of integers [0..k-1]. Thus each element of the set is mapped into one of k buckets. Each of the buckets will contain all the elements that are mapped by h into that bucket.

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Example

A mod function is a good example of a hash function. For example suppose we use h(x) = xmod7. Then 0 to 6 gets mapped to 0 to 6 but 7 gets mapped to 0 and 8 to 1. Thus the range of mod7 is [0..6]. These are the buckets of mod7.

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Example

Suppose X is the set of integers 1..1000 1 0,7,14,21…. 1,8,15,22….

23 4 5

6

6,13,20,27….

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Factors Affecting Complexity

Choice of mini

mum support threshold– – lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase since Apriori makes multiple passes, run time of algorithm may increase with number of transactions

Dimensionality (number of items) of the data set– –

Size of database–

Average transaction width– transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have identical support as their supersets

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10 Number of frequent itemsets 3 k 10 k 1

Need a compact representationIntroduction to Data Mining 4/18/2004 #

© Tan,Steinbach, Kumar

COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1.doc 将本文的Word文档下载到电脑

    精彩图片

    热门精选

    大家正在看

    × 游客快捷下载通道(下载后可以自由复制和排版)

    限时特价:7 元/份 原价:20元

    支付方式:

    开通VIP包月会员 特价:29元/月

    注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
    微信:fanwen365 QQ:370150219