COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1

时间:2025-03-10

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactionsTID Items

Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Implication means co-occurrence, not causality!

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Definition: Frequent Itemset

Itemset– A collection of one or more items

Example: {Milk, Bread, Diaper}TID Items

– k-itemset

An itemset that contains k items

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2

Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Definition: Association Rule

Association Rule– An implication expression of the form X Y, where X and Y are itemsets

TID

Items

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

– Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

Example:

{Milk, Diaper} Beers

– Confidence (c)

Measures how often items in Y appear in transactions that contain X

(Milk , Diaper, Beer )|T|

2 0.4 5

(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #

© Tan,Steinbach, Kumar

Introduction to Data Mining

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold

Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Mining Association RulesTID Items

Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5

)

1 2 3 4 5

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence

Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Mining Association Rules

Two-step approach:1. Frequent Itemset Generation–

Generate all itemsets whose support minsup

2. Rule Generation–

Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

#

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Frequent Itemset Generationnull

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets4/18/2004 #

© Tan,Steinbach, Kumar

Introduction to Data Mining

University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011

Frequent Itemset Generation

Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

List of Candidates

N

M

w

– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #

University of Sydney_COMP5318 Kno …… 此处隐藏:9635字,全部文档内容请下载后查看。喜欢就下载吧 ……

COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1.doc 将本文的Word文档下载到电脑

    精彩图片

    热门精选

    大家正在看

    Copyright © 2023-2024 学科文库 版权所有
    本网站尊重并保护知识产权,根据《信息网络传播权保护条例》,如果我们转载的作品侵犯了您的权利,请在一个月内通知我们,我们会及时删除。
    客服QQ:370150219 邮箱:370150219@qq.com
    苏ICP备16052595号-5

    × 游客快捷下载通道(下载后可以自由复制和排版)

    限时特价:7 元/份 原价:20元

    支付方式:

    开通VIP包月会员 特价:29元/月

    注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
    微信:fanwen365 QQ:370150219