Market Basket Analysis: Cafe Cross-Selling Optimization

1. Business Background

Bakeries and cafes often have primary products (such as coffee and bread) that sell very well, but struggle to drive sales of secondary products. This analysis uses data from the following Kaggle link: https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket and uses the Association Rule Mining method with an Apriori algorithm to reveal hidden customer purchasing patterns.

This report uses a step by step analysis:

Basic rules.
Rules by applying context to the type of time (morning, afternoon and evening) and type of day (weekday or weekend).
Applying logical filters to extract business insights that are truly actionable.

2. Data Preparation

The first step is to load the data set and filter the data for noise, such as “adjustment” transactions.

# Load required libraries
library(dplyr)
library(stringr)
library(arules)
library(arulesViz)
library(ggplot2)
library(DT)

# Import Data (Adjust the file path accordingly)
data <- read.csv("D:\\Documents\\bread basket.csv", sep = ",")
datatable(data)

Here we will compare the rule without context of time type and day type and use it. Therefore we will create 2 different data structures.

# 1. Prepare Static Data (Without Time)
data_static <- data %>% 
  mutate(Item = tolower(str_trim(Item))) %>%
  filter(Item != "adjustment" & Item != "none") %>%
  group_by(Transaction) %>%
  summarise(list_item = paste(Item, collapse = ", "), .groups = "drop")

data_list <- strsplit(data_static$list_item, ", ")
names(data_list) <- data_static$Transaction
data_trans <- as(data_list, "transactions")

# 2. Prepare Dynamic Data (With Time)
data_time <- data %>% 
  mutate(Item = tolower(str_trim(Item))) %>%
  filter(Item != "adjustment") %>%
  group_by(Transaction) %>%
  summarise(list_item = paste(c(unique(Item), unique(period_day), unique(weekday_weekend)), collapse = ", "), .groups = "drop")

data_time_list <- strsplit(data_time$list_item, ", ")
names(data_time_list) <- data_time$Transaction
data_time_trans <- as(data_time_list, "transactions")

3. Exploratory Data Analysis (EDA)

Before deciding on association rules, it is important to identify the most frequently purchased items.

# Extract item frequencies
item_count <- itemFrequency(data_trans, type = "absolute")
df_freq <- data.frame(
  Item = names(item_count),
  Count = as.numeric(item_count)
)

# Extract Top 5 Items
df_top5 <- df_freq %>% arrange(desc(Count)) %>% head(5)

ggplot(df_top5, aes(x = reorder(Item, Count), y = Count, fill = Item)) +
  geom_col(show.legend = FALSE, width = 0.6) + 
  geom_text(aes(label = Count), hjust = -0.2, size = 4) +
  coord_flip() + 
  labs(
    title = "Top 5 Most Frequent Items",
    subtitle = "Coffee dominates the overall customer transactions",
    x = "Menu Item",
    y = "Total Sold"
  ) +
  theme_minimal(base_size = 13) + 
  scale_fill_brewer(palette = "Set2") +
  theme(panel.grid.major.y = element_blank())

4. Progressive Association Rule Mining

We will run the Apriori algorithm in several stages to show how raw data mining results should be filtered to produce actionable business intelligence.

Stage 1: Static Rules (Without Time)

First, we look at pure item-to-item associations. We use a high confidence threshold (50%) to see the strongest absolute relationships.

ass_rule_static <- apriori(data_trans, 
                           parameter = list(supp = 0.02, conf = 0.5, minlen = 2),
                           control = list(verbose = FALSE))

datatable(as(sort(ass_rule_static, by = "lift"), "data.frame"), 
          options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
          caption = "Stage 1: Pure Item Associations")

Insight: Pay attention to all the rules stating that all types of food and drinks purchased will make consumers buy coffee. This seems normal, because coffee was initially sold at the top, namely 4,500, far from the bottom because coffee was the main product.

Stage 2: Dynamic Rules (Type of Time and Type of Day Context)

Next, we add period_day and weekday_weekend. We higher the confidence to 70% to capture secondary product relationships using time and day context. After generating the rules, we prune the redundant ones.

# Run Apriori
ass_time_rule <- apriori(data_time_trans, 
                         parameter = list(supp = 0.03, conf = 0.7, minlen = 2), 
                         control = list(verbose = FALSE))

# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]

datatable(as(sort(rules_pruned, by = "lift"), "data.frame"), 
          options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
          caption = "Stage 2: Dynamic Rules")

Insight: Note any rules recommending “morning” or “afternoon” in the RHS. At a glance, we can conclude that selling soup in the afternoon or selling coffee and pastries in the morning is right. But we can’t say that because time is in RHS meaning the cashier cannot sell time, these rules are operationally useless.

Stage 3: The Coffee Dominance Effect (Time Filtered)

To fix the issue above, we ban time contexts from appearing in the Right-Hand Side (RHS) but lowering the confidence to 50% because the highest confidence in the previous context is dominated by time.

# Run Apriori
ass_time_rule <- apriori(data_time_trans, 
                         parameter = list(supp = 0.02, conf = 0.5, minlen = 2), 
                         control = list(verbose = FALSE))

# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]

# Filter out time from RHS
time_words <- c("morning", "afternoon", "evening", "night", "weekday", "weekend")
rules_no_time <- subset(rules_pruned, subset = !(rhs %in% time_words))

datatable(as(sort(rules_no_time, by = "lift"), "data.frame"), 
          options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
          caption = "Stage 3: Time Filtered")

Insight: Now the RHS only contains physical products. However, almost every rule points to “Coffee”. While statistically accurate because everyone buys coffee, this prevents us from finding specific cross-selling opportunities.

Stage 4: Final Rules (Time & Coffee Filtered)

To uncover the hidden patterns of the dataset, we execute the final filter by removing “Coffee” from the RHS. This reveals secondary associations that were previously overshadowed.

# Run Apriori
ass_time_rule <- apriori(data_time_trans, 
                         parameter = list(supp = 0.01, conf = 0.15, minlen = 2), 
                         control = list(verbose = FALSE))

# Rule Pruning (Removing redundancies)
ass_time_sorted <- sort(ass_time_rule, by = "confidence", decreasing = TRUE)
subset_matrix <- is.subset(ass_time_sorted, ass_time_sorted, sparse = FALSE)
subset_matrix[lower.tri(subset_matrix, diag = TRUE)] <- NA
redundant <- colSums(subset_matrix, na.rm = TRUE) >= 1
rules_pruned <- ass_time_sorted[!redundant]

# Filter out time AND coffee from RHS
banned_words <- c("morning", "afternoon", "evening", "night", "weekday", "weekend", "coffee")
rules_final <- subset(rules_pruned, subset = !(rhs %in% banned_words))

datatable(as(sort(rules_final, by = "lift"), "data.frame"), 
          options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE,
          caption = "Stage 4: Final Rules")

Insight: Even though we set the minimum confidence at 0.2, that is already good for secondary sales. For example, confidence 0.2 means that 1 in 5 buyers buy tea when buying a sandwich. For secondary sales, this is an action that needs to be tried to encourage sales other than coffee

5. Network Visualization

The following interactive graphic visualizes the final association rules that have been well curated. Larger nodes indicate frequency of occurrence, while darker colors indicate stronger associative correlations.

# Network graph for the top final rules
plot(head(sort(rules_final, by = "lift", decreasing = TRUE), 15), 
     method = "graph", 
     engine = "htmlwidget")