COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1
时间:2025-03-10
时间:2025-03-10
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactionsTID Items
Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Implication means co-occurrence, not causality!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}TID Items
– k-itemset
An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Association Rule
Association Rule– An implication expression of the form X Y, where X and Y are itemsets
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
Example:
{Milk, Diaper} Beers
– Confidence (c)
Measures how often items in Y appear in transactions that contain X
(Milk , Diaper, Beer )|T|
2 0.4 5
(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association RulesTID Items
Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5
)
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence
Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association Rules
Two-step approach:1. Frequent Itemset Generation–
Generate all itemsets whose support minsup
2. Rule Generation–
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generationnull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets4/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation
Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
N
M
w
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Kno …… 此处隐藏:9635字,全部文档内容请下载后查看。喜欢就下载吧 ……