Introduction
Association Rule Mining is a powerful technique used to uncover hidden patterns and relationships between items in transactional data. It is widely applied in market basket analysis to identify product bundles and understand customer purchasing behavior. This project applies the Apriori Algorithm on a Groceries dataset to discover frequent itemsets and generate strong association rules. These insights can inform marketing strategies, such as product recommendations and cross-selling.
Dataset Description
The Groceries dataset consists of transactional data collected from a retail store. Key characteristics of the dataset:
Transactions: Each record represents a unique customer’s purchase.
Items: Products purchased by customers, including food and beverages like “whole milk,” “sausage,” “tropical fruit,” etc.
Objective: To identify frequently purchased combinations and derive actionable insights.
Data Quality Check
data <- read.csv("C:/Users/abdul/Documents/data/Groceries_dataset.csv")
missing_values <- sum(is.na(data))
if (missing_values > 0) {
cat("Missing values detected: ", missing_values, "\n")
data <- na.omit(data)
cat("Missing values removed.\n")
} else {
cat("No missing values detected.\n")
}
## Missing values detected: 28765
## Missing values removed.
transactions <- split(data$itemDescription, data$Member_number)
transactions <- lapply(transactions, unique) # Remove duplicates in transactions
transactions <- as(transactions, "transactions")
summary(transactions)
## transactions as itemMatrix in sparse format with
## 3512 rows (elements/itemsets/transactions) and
## 158 columns (items) and a density of 0.01721223
##
## most frequent items:
## sausage whole milk frankfurter tropical fruit
## 729 670 536 454
## other vegetables (Other)
## 441 6721
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 937 897 722 471 271 132 47 26 5 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 2.72 4.00 11.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1 1000
## 2 1001
## 3 1002
Exploratory Data Analysis
itemFrequencyPlot(transactions, topN = 20, col = "steelblue", main = "Top 20 Purchased Items")
Applying the Apriori Algorithm
The Apriori Algorithm is employed to mine frequent itemsets and generate association rules. Key thresholds:
Support: Minimum proportion of transactions
containing an itemset (0.002).
Confidence: Probability of purchasing consequent
items given the antecedent (0.4).
Lift: Measures the strength of a rule relative to random co-occurrence.
Generating Rules
rules <- apriori(transactions,
parameter = list(supp = 0.002, conf = 0.4, minlen = 2, maxlen = 5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.002 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 7
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[158 item(s), 3512 transaction(s)] done [0.00s].
## sorting and recoding items ... [95 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- sort(rules, by = "lift")
strong_rules <- subset(rules, lift > 1.0 & confidence > 0.4)
if (length(strong_rules) > 100) {
strong_rules <- strong_rules[1:100]
}
Inspecting Strong Rules
if (length(strong_rules) > 0) {
inspect(strong_rules[1:min(10, length(strong_rules))])
} else {
print("No strong association rules found.")
}
## lhs rhs support confidence
## [1] {bottled beer, other vegetables} => {sausage} 0.002847380 0.5263158
## [2] {pastry, tropical fruit} => {whole milk} 0.003416856 0.4800000
## [3] {root vegetables, soda} => {sausage} 0.002277904 0.4210526
## [4] {beef, rolls/buns} => {sausage} 0.003132118 0.4074074
## coverage lift count
## [1] 0.005410023 2.535557 10
## [2] 0.007118451 2.516060 12
## [3] 0.005410023 2.028446 8
## [4] 0.007687927 1.962709 11
Visualizations
if (length(rules) > 0) {
plot(rules[1:min(20, length(rules))], method = "graph", control = list(layout = "circle", max = 20))
}
Scatterplot of Rules
if (length(rules) > 0) {
plot(rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift", jitter = 0)
}
Grouped Matrix Visualization
if (length(rules) > 0) {
plot(rules[1:min(20, length(rules))], method = "grouped")
}
Barplot of Top 10 Rules by Lift
if (length(strong_rules) > 0) {
num_rules <- min(10, length(strong_rules))
top_rules <- as(strong_rules, "data.frame")
top_rules <- top_rules[order(-top_rules$lift),][1:num_rules, ]
# Combine LHS and RHS for rule labels
top_rules$rules <- paste(
labels(lhs(strong_rules)[1:num_rules]),
"=>",
labels(rhs(strong_rules)[1:num_rules])
)
ggplot(top_rules, aes(x=reorder(rules, lift), y=lift)) +
geom_bar(stat="identity", fill="steelblue") +
coord_flip() +
geom_text(aes(label = round(lift, 2)), hjust = -0.2) +
labs(title="Top 10 Association Rules by Lift", x="Rules", y="Lift")
}
Interpretation of Results
Key Insights
whole milk and
tropical fruit form strong connections with other items,
indicating frequent co-purchases.{bottled beer, other vegetables} => {sausage} and
{pastry, tropical fruit} => {whole milk} stand out as
significant.Conclusion
This analysis demonstrates the effectiveness of association rule mining for discovering meaningful patterns in transactional data. The insights derived can guide strategies for product bundling, targeted promotions, and inventory management. Future work can extend this analysis by exploring temporal trends or incorporating demographic data to further refine insights.