Bakeries and cafes often have primary products (such as coffee and bread) that sell very well, but struggle to drive sales of secondary products. This analysis uses data from the following Kaggle link: https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket and uses the Association Rule Mining method with an Apriori algorithm to reveal hidden customer purchasing patterns.
This report uses a step by step analysis:
The first step is to load the data set and filter the data for noise, such as “adjustment” transactions.
# Load required libraries
library(dplyr)
library(stringr)
library(arules)
library(arulesViz)
library(ggplot2)
library(DT)
# Import Data (Adjust the file path accordingly)
data <- read.csv("D:\\Documents\\bread basket.csv", sep = ",")
datatable(data)
Here we will compare the rule without context of time type and day type and use it. Therefore we will create 2 different data structures.
# 1. Prepare Static Data (Without Time)
data_static <- data %>%
mutate(Item = tolower(str_trim(Item))) %>%
filter(Item != "adjustment" & Item != "none") %>%
group_by(Transaction) %>%
summarise(list_item = paste(Item, collapse = ", "), .groups = "drop")
data_list <- strsplit(data_static$list_item, ", ")
names(data_list) <- data_static$Transaction
data_trans <- as(data_list, "transactions")
# 2. Prepare Dynamic Data (With Time)
data_time <- data %>%
mutate(Item = tolower(str_trim(Item))) %>%
filter(Item != "adjustment") %>%
group_by(Transaction) %>%
summarise(list_item = paste(c(unique(Item), unique(period_day), unique(weekday_weekend)), collapse = ", "), .groups = "drop")
data_time_list <- strsplit(data_time$list_item, ", ")
names(data_time_list) <- data_time$Transaction
data_time_trans <- as(data_time_list, "transactions")
Before deciding on association rules, it is important to identify the most frequently purchased items.
# Extract item frequencies
item_count <- itemFrequency(data_trans, type = "absolute")
df_freq <- data.frame(
Item = names(item_count),
Count = as.numeric(item_count)
)
# Extract Top 5 Items
df_top5 <- df_freq %>% arrange(desc(Count)) %>% head(5)
ggplot(df_top5, aes(x = reorder(Item, Count), y = Count, fill = Item)) +
geom_col(show.legend = FALSE, width = 0.6) +
geom_text(aes(label = Count), hjust = -0.2, size = 4) +
coord_flip() +
labs(
title = "Top 5 Most Frequent Items",
subtitle = "Coffee dominates the overall customer transactions",
x = "Menu Item",
y = "Total Sold"
) +
theme_minimal(base_size = 13) +
scale_fill_brewer(palette = "Set2") +
theme(panel.grid.major.y = element_blank())
We will run the Apriori algorithm in several stages to show how raw data mining results should be filtered to produce actionable business intelligence.
First, we look at pure item-to-item associations. We use a high confidence threshold (50%) to see the strongest absolute relationships.
ass_rule_static <- apriori(data_trans,
parameter = list(supp = 0.02, conf = 0.5, minlen = 2),
control = list(verbose = FALSE))
datatable(as(sort(ass_rule_static, by = "lift"), "data.frame"),
options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
caption = "Stage 1: Pure Item Associations")
Insight: Pay attention to all the rules stating that all types of food and drinks purchased will make consumers buy coffee. This seems normal, because coffee was initially sold at the top, namely 4,500, far from the bottom because coffee was the main product.
Next, we add period_day and
weekday_weekend. We higher the confidence to 70% to capture
secondary product relationships using time and day context. After
generating the rules, we prune the redundant ones.
# Run Apriori
ass_time_rule <- apriori(data_time_trans,
parameter = list(supp = 0.03, conf = 0.7, minlen = 2),
control = list(verbose = FALSE))
# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]
datatable(as(sort(rules_pruned, by = "lift"), "data.frame"),
options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
caption = "Stage 2: Dynamic Rules")
Insight: Note any rules recommending “morning” or “afternoon” in the RHS. At a glance, we can conclude that selling soup in the afternoon or selling coffee and pastries in the morning is right. But we can’t say that because time is in RHS meaning the cashier cannot sell time, these rules are operationally useless.
To fix the issue above, we ban time contexts from appearing in the Right-Hand Side (RHS) but lowering the confidence to 50% because the highest confidence in the previous context is dominated by time.
# Run Apriori
ass_time_rule <- apriori(data_time_trans,
parameter = list(supp = 0.02, conf = 0.5, minlen = 2),
control = list(verbose = FALSE))
# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]
# Filter out time from RHS
time_words <- c("morning", "afternoon", "evening", "night", "weekday", "weekend")
rules_no_time <- subset(rules_pruned, subset = !(rhs %in% time_words))
datatable(as(sort(rules_no_time, by = "lift"), "data.frame"),
options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
caption = "Stage 3: Time Filtered")
Insight: Now the RHS only contains physical products. However, almost every rule points to “Coffee”. While statistically accurate because everyone buys coffee, this prevents us from finding specific cross-selling opportunities.
To uncover the hidden patterns of the dataset, we execute the final filter by removing “Coffee” from the RHS. This reveals secondary associations that were previously overshadowed.
# Run Apriori
ass_time_rule <- apriori(data_time_trans,
parameter = list(supp = 0.01, conf = 0.15, minlen = 2),
control = list(verbose = FALSE))
# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]
# Filter out time AND coffee from RHS
banned_words <- c("morning", "afternoon", "evening", "night", "weekday", "weekend", "coffee")
rules_final <- subset(rules_pruned, subset = !(rhs %in% banned_words))
datatable(as(sort(rules_final, by = "lift"), "data.frame"),
options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
caption = "Stage 4: Final Rules")
Insight: Even though we set the minimum confidence at 0.2, that is already good for secondary sales. For example, confidence 0.2 means that 1 in 5 buyers buy tea when buying a sandwich. For secondary sales, this is an action that needs to be tried to encourage sales other than coffee
The following interactive graphic visualizes the final association rules that have been well curated. Larger nodes indicate frequency of occurrence, while darker colors indicate stronger associative correlations.
# Network graph for the top final rules
plot(head(sort(rules_final, by = "lift", decreasing = TRUE), 15),
method = "graph",
engine = "htmlwidget")