COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1
发布时间:2021-06-06
发布时间:2021-06-06
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactionsTID Items
Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Implication means co-occurrence, not causality!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}TID Items
– k-itemset
An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Association Rule
Association Rule– An implication expression of the form X Y, where X and Y are itemsets
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
Example:
{Milk, Diaper} Beers
– Confidence (c)
Measures how often items in Y appear in transactions that contain X
(Milk , Diaper, Beer )|T|
2 0.4 5
(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association RulesTID Items
Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5
)
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence
Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association Rules
Two-step approach:1. Frequent Itemset Generation–
Generate all itemsets whose support minsup
2. Rule Generation–
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generationnull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets4/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation
Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
N
M
w
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Computational Complexity
Given d unique items:– Total number of itemsets = 2d – Total number of possible association rules:
d d k R k j 3 2 1d 1 k 1 d k j 1 d d 1
If d=6, R = 602 rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d – Use pruning techniques to reduce M
Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the support measure:
X ,Y : ( X Y ) s( X ) s(Y )– Support of an itemset never exceeds the support of its su
bsets – This is known as the anti-monotone property of support© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori Principlenull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
Found to be InfrequentABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Pruned supersets© Tan,Steinbach, Kumar Introduction to Data Mining
ABCDE
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori PrincipleItem Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1
Items (1-itemsets)
Minimum Support = 3If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Count 3 2 3 2 3 3
Pairs (2-itemsets)(No need to generate candidates involving Coke or Eggs)
Triplets (3-itemsets)Itemset {Bread,Milk,Diaper} Count 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Apriori Algorithm
Method:– Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate
length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Comparisons
Candidate counting:– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structureInstead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
TransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Hash Structure
N
k
Buckets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Introduction to Hash Functions
A Hash Function h is a mapping from a set X to a range of integers [0..k-1]. Thus each element of the set is mapped into one of k buckets. Each of the buckets will contain all the elements that are mapped by h into that bucket.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
A mod function is a good example of a hash function. For example suppose we use h(x) = xmod7. Then 0 to 6 gets mapped to 0 to 6 but 7 gets mapped to 0 and 8 to 1. Thus the range of mod7 is [0..6]. These are the buckets of mod7.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
Suppose X is the set of integers 1..1000 1 0,7,14,21…. 1,8,15,22….
23 4 5
6
6,13,20,27….
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Factors Affecting Complexity
Choice of mini
mum support threshold– – lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Dimensionality (number of items) of the data set– –
Size of database–
Average transaction width– transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10 Number of frequent itemsets 3 k 10 k 1
Need a compact representationIntroduction to Data Mining 4/18/2004 #
© Tan,Steinbach, Kumar