Association Rule Mining Study Guide
Overview of Association Analysis
Association analysis examines relationships between items in a
dataset to determine patterns of co-occurrence. In association rule
mining, these patterns are used to generate rules that
predict when certain items will appear together.
For example, in a retail setting, this analysis might reveal that “if
a customer buys bread and milk, they are likely to also buy eggs.”
Key Concepts
- Association Rule Mining:
- Definition: Given a set of transactions,
association rule mining seeks to find patterns or rules that explain
when items co-occur in transactions.
- Applications: Common in recommendation systems
(e.g., Amazon’s “Customers who bought this also bought”) and in
healthcare data (e.g., high blood pressure often co-occurs with
stress).
1. Association Rule Mining Recap
Association rule mining aims to uncover relationships between
variables in transaction data. Each transaction is a collection of
items, and the objective is to determine if certain items frequently
co-occur. These relationships can be represented as association
rules of the form \(X \rightarrow
Y\), where \(X\) implies \(Y\) with a certain degree of
confidence and support.
- Frequent Itemsets:
- Definition: Collections of one or more items that
appear frequently together in the dataset. For example,
{bread, milk}
could be a frequent itemset.
- Support Count: The frequency of occurrence of an
itemset within the dataset.
- Support: The proportion of transactions that
contain the itemset, calculated as: \[
\text{Support} = \frac{\text{Support Count}}{\text{Total Transactions}}
\]
2. Frequent Itemset Generation
Generating frequent itemsets involves identifying groups of items
that often appear together in transactions. The candidate
itemset lattice is used to systematically evaluate itemset
combinations, starting from single items and progressing to larger
groups. Key to reducing computational load is the Apriori
principle, which helps to avoid evaluating all possible
combinations by eliminating itemsets that do not meet the minimum
support threshold.
Example: For items like
{milk, bread, diapers}
, frequent itemsets are those that
appear in a sufficient number of transactions. Calculating support for
itemsets and generating candidate lattices enables structured
exploration of co-occurrence relationships.
- Association Rules:
- Definition: Expressed as \(X \rightarrow Y\), where \(X\) and \(Y\) are itemsets, meaning “If \(X\) occurs, \(Y\) is likely to occur.”
- Support for a Rule: The fraction of transactions
that contain both \(X\) and \(Y\).
- Confidence for a Rule: Indicates how often \(Y\) appears in transactions that contain
\(X\), given by: \[
\text{Confidence}(X \rightarrow Y) = \frac{\text{Support of } X \cup
Y}{\text{Support of } X}
\]
Rule Evaluation Metrics
- Support: Helps in identifying the overall frequency
of an itemset.
- Confidence: Measures the reliability of an
inference; higher confidence indicates a stronger association.
- Lift (optional): Measures how much more likely
\(Y\) is to occur when \(X\) is present compared to when \(X\) is absent.
3. Rule Generation
Once frequent itemsets are identified, the next step is creating
association rules from these sets, ensuring each rule meets minimum
support and confidence requirements. These rules are derived from
itemsets by dividing them into antecedents (if-part) and consequents
(then-part), with metrics calculated for each rule.
Rule Metrics: - Support: Measures
how often \(X \cup Y\) appears in the
dataset. - Confidence: Measures the likelihood of \(Y\) appearing when \(X\) is present, calculated as: \[
\text{Confidence}(X \rightarrow Y) = \frac{\text{Support of } X \cup
Y}{\text{Support of } X}
\] - Lift: Evaluates the strength of an
association rule relative to random chance, providing insight into the
rule’s significance.
Steps in Association Rule Mining
- Frequent Itemset Generation:
- Generate all itemsets that meet a minimum support threshold, a
process that is computationally intensive.
- Rule Generation:
- Use the frequent itemsets to create association rules, ensuring they
meet both minimum support and confidence thresholds.
- Optimizing Computation:
- Candidate Lattice: Visualizes all potential
itemsets for frequent set mining.
- Apriori Principle: If an itemset is infrequent, all
supersets are also infrequent, helping to reduce computation by
pruning.
Example Calculations
Suppose a dataset has 5 transactions, and
{milk, bread, diapers}
appears in 2 of them: -
Support Count for {milk, bread, diapers}
:
2 - Support for {milk, bread, diapers}
:
\[
\text{Support} = \frac{2}{5} = 0.4
\]
If {milk, diapers}
appears in 3 transactions, the
confidence that
{milk, diapers} \rightarrow beer
is: \[
\text{Confidence} = \frac{\text{Support of } \{milk, diapers,
beer\}}{\text{Support of } \{milk, diapers\}}
\]
Python Code for Support and Confidence Calculation
from itertools import combinations
# Example transaction data
transactions = [
{'milk', 'bread', 'beer'},
{'milk', 'diapers', 'beer'},
{'milk', 'bread', 'diapers'},
{'bread', 'diapers', 'beer'},
{'milk', 'bread'}
]
def support(itemset, transactions):
count = sum(1 for transaction in transactions if itemset.issubset(transaction))
return count / len(transactions)
def confidence(itemset_X, itemset_Y, transactions):
return support(itemset_X | itemset_Y, transactions) / support(itemset_X, transactions)
# Calculate support and confidence for a rule
itemset_X = {'milk', 'diapers'}
itemset_Y = {'beer'}
rule_support = support(itemset_X | itemset_Y, transactions)
rule_confidence = confidence(itemset_X, itemset_Y, transactions)
print("Support:", rule_support)
print("Confidence:", rule_confidence)
Visualizing the Candidate Lattice
To visualize a candidate lattice and optimize
computation: 1. List items in a hierarchy (e.g., single items, pairs,
triples). 2. Use the Apriori Principle to prune
infrequent itemsets.
import networkx as nx
import matplotlib.pyplot as plt
# Example of candidate lattice visualization
G = nx.DiGraph()
G.add_edges_from([
("A", "AB"), ("A", "AC"), ("B", "AB"),
("B", "BC"), ("C", "AC"), ("C", "BC"),
("AB", "ABC"), ("AC", "ABC"), ("BC", "ABC")
])
plt.figure(figsize=(8, 6))
nx.draw(G, with_labels=True, node_size=3000, node_color="lightblue", font_size=10, font_weight="bold")
plt.show()
Summary
- Association rule mining finds co-occurrence
patterns in transaction data.
- Support and confidence help
quantify how strongly items are associated.
- Apriori Principle and candidate
lattice reduce computational load, enabling scalable
analysis.
4. Efficient Computation with the Apriori Principle
The Apriori algorithm optimizes frequent itemset generation by
eliminating non-frequent itemsets and their supersets early in the
process. This reduces the search space and makes rule generation
computationally feasible for large datasets, often visualized as a
candidate lattice to track the hierarchical progression
of item combinations.
5. Implementation Example in Python
To compute support and confidence for specific rules in Python:
from itertools import combinations
# Example transactions
transactions = [
{'milk', 'bread', 'beer'},
{'milk', 'diapers', 'beer'},
{'milk', 'bread', 'diapers'},
{'bread', 'diapers', 'beer'},
{'milk', 'bread'}
]
# Function to calculate support
def support(itemset, transactions):
count = sum(1 for transaction in transactions if itemset.issubset(transaction))
return count / len(transactions)
# Function to calculate confidence
def confidence(itemset_X, itemset_Y, transactions):
return support(itemset_X | itemset_Y, transactions) / support(itemset_X, transactions)
# Calculate support and confidence for a rule
itemset_X = {'milk', 'diapers'}
itemset_Y = {'beer'}
rule_support = support(itemset_X | itemset_Y, transactions)
rule_confidence = confidence(itemset_X, itemset_Y, transactions)
print("Support:", rule_support)
print("Confidence:", rule_confidence)
This code provides a structured way to calculate support and
confidence for a rule such as {milk, diapers} → beer
,
helping identify meaningful associations in transaction data.
6. Visualizing Candidate Lattices
The candidate lattice organizes itemsets by their frequency, aiding
in the application of the Apriori principle. Visualizing it with
NetworkX:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
G.add_edges_from([
("A", "AB"), ("A", "AC"), ("B", "AB"),
("B", "BC"), ("C", "AC"), ("C", "BC"),
("AB", "ABC"), ("AC", "ABC"), ("BC", "ABC")
])
plt.figure(figsize=(8, 6))
nx.draw(G, with_labels=True, node_size=3000, font_size=10, font_weight="bold")
plt.show()
Recap
Association rule mining, supported by efficient itemset generation
through the Apriori principle and candidate lattices, enables scalable
analysis of transactional data, revealing patterns with practical
applications across industries.
These concepts can be used to identify valuable patterns that can be
applied in business, healthcare, and other domains requiring
co-occurrence analysis.
