1. Introduction

The business problem I address is that I believe Regork can invest in smarter promotions to increase revenue from customers who are already shopping there.

By reviewing hundreds of thousands of shopping trips from the Complete Journey transaction database, I identify which product categories shoppers consistently purchased together in single shopping trips. I then rank the strongest pairings by how often they appeared and compare that with the average spend of customers who bought the top pairs together vs. those who did not.

My findings give Regork a clear data-backed picture of how consumers are currently shopping and where to find the best opportunities for increasing revenue.

2. Required Packages

The following packages were used for this analysis

# load packages 
library(completejourney)   # Regork transaction data
library(tidyverse)         # for data wrangling (dplyr) & visualization (ggplot2)
library(scales)            # for number formatting in ggplot2
library(knitr)             # for rendering clean tables
library(kableExtra)        # for styling knitr tables

3. Data Preparation & Tidying

My analysis draws from two datasets within the Complete Journey database. First, the transactions dataset, containing every individual line-item purchased across all households and shopping trips. It includes the basket ID, quantity, and sales value.

Next, the products dataset, which contains a reference table linking each product ID to the category and type.

# Load data
transactions <- get_transactions()

# Join datasets
trans_products <- transactions %>%
  left_join(products, by = "product_id")

# Check for any missing rows
n_missing <- sum(is.na(trans_products$product_category))
cat("Rows missing product category:", n_missing,
    "(", round(100 * n_missing / nrow(trans_products), 1), "%)\n")
## Rows missing product category: 7045 ( 0.5 %)
# clean data 

trans_clean <- trans_products %>%
  filter(!is.na(product_category), quantity > 0) %>%
  select(basket_id, household_id, product_category, sales_value)

cat("Clean transaction lines:", nrow(trans_clean), "\n")
## Clean transaction lines: 1458246
cat("Unique baskets:         ", n_distinct(trans_clean$basket_id), "\n")
## Unique baskets:          155485
cat("Unique categories:      ", n_distinct(trans_clean$product_category), "\n")
## Unique categories:       302

After joining and cleaning data I filtered out any purchases with zero quantity. This leaves a clean dataset ready to be analyzed.

4. Exploratory Data Analysis

4.1 What are our shoppers buying?

First we want to understand the shopping landscape at Regork. Which product categories are the most common? We can measure this by looking at basket penetration, the percentage of all shopping trips that include at least one item from a given category.

total_baskets <- n_distinct(trans_clean$basket_id)

penetration_data <- trans_clean %>%
  group_by(product_category) %>%
  summarise(n_baskets = n_distinct(basket_id), total_spend = sum(sales_value)) %>%
  mutate(penetration = n_baskets / total_baskets) %>%
  arrange(desc(penetration))

penetration_data %>%
  slice_head(n = 15) %>%
  mutate(product_category = fct_reorder(product_category, penetration)) %>%
  ggplot(aes(x = penetration, y = product_category, fill = penetration)) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
  labs(title = "Top 15 Categories by Basket Penetration", x = "Basket Penetration (%)", y = NULL) +
  theme_minimal(base_size = 12)

Findings

I find that a small number of categories dominate basket penetration, these primarily include your everyday staples. This is essential for our pairing analysis because high-penetration categories are the most natural triggers for paired promotions. When a shopper constantly buys from one of these categories, that is our queue to sell them on a complementary item.

4.2 Which Categories Are Shoppers Buying Together?

For every shopping basket in the dataset, I can identify the unique product categories included and then generated every possible pair. Then count how many baskets contain each pair across the entire dataset.

# collapses every basket to a list of unique categories purchased
basket_data <- trans_clean %>%
  group_by(basket_id) %>%
  summarise(categories = list(unique(product_category)), .groups = "drop")

# returns all unique pairs from a category vector
pairs <- function(cats) {
  if (length(cats) < 2) return(tibble(cat_a = character(), cat_b = character()))
  pairs <- combn(sort(cats), 2)
  tibble(cat_a = pairs[1, ], cat_b = pairs[2, ])
}

# Apply to each basket
all_pairs <- basket_data %>%
  mutate(pairs = map(categories, pairs)) %>%
  select(basket_id, pairs) %>%
  unnest(pairs)

# Count pairs and calculate the % of all baskets containing each pair
pair_count <- all_pairs %>%
  group_by(cat_a, cat_b) %>%
  summarise(n_baskets = n(), .groups = "drop") %>%
  mutate(support = n_baskets / total_baskets) %>%
  arrange(desc(n_baskets))


cat("Total unique category pairs found:", nrow(pair_count), "\n")
## Total unique category pairs found: 37439

Findings

Here I am able to produce a ranked list of every category pair found across all shopping trips, along with a support percentage that tells me how widespread each pairing is across the entire consumer base. Pairs at the top of this list are the ones that appear consistently and will form my recommendations.

4.3 Which Are Our Top Pairs?

This chart ranks the top 15 category pairs by how many baskets contained the pair. The total basket count is provided at the end of each bar.

pair_count %>%
  slice_head(n = 15) %>%
  mutate(pair_label = paste(cat_a, "+", cat_b),
         pair_label = fct_reorder(pair_label, n_baskets)) %>%
  ggplot(aes(x = n_baskets, y = pair_label, fill = n_baskets)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n_baskets)), hjust = -0.1, size = 3.5) +
  scale_x_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
  labs(title = "Top 15 Co-purchased Category Pairs", x = "Number of Baskets", y = NULL) +
  theme_minimal(base_size = 13)

Findings

As you can see the results reflect a consistent shopping behavior that repeats across the customer base trip after trip. The combo of Bread + Milk is by far the most consistent pair. With combos of milk, soda, bread and cheese being notable as well. These pairs show us where promotional investments should be used first.

4.4 How Strong Are the Top Pairings?

To add context to our data, I’ve added a support percentage for each pair to the table below. The support percentage tells us what share of all shopping trips contained both categories. The top 5 combos are highlighted to show our strongest revenue opportunities.

pair_count %>%
  slice_head(n = 10) %>%
  mutate(support_pct = percent(support, accuracy = 0.1)) %>%
  select(`Category A` = cat_a, `Category B` = cat_b,
         `Baskets Together` = n_baskets, `Support` = support_pct) %>%
  kable(align = c("l","l","r","r"), format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(1:5, background = "#eaf4fb")
Category A Category B Baskets Together Support
BAKED BREAD/BUNS/ROLLS FLUID MILK PRODUCTS 17,531 11.3%
FLUID MILK PRODUCTS SOFT DRINKS 14,201 9.1%
CHEESE FLUID MILK PRODUCTS 13,999 9.0%
BAKED BREAD/BUNS/ROLLS SOFT DRINKS 13,674 8.8%
BAKED BREAD/BUNS/ROLLS CHEESE 13,602 8.7%
BAG SNACKS SOFT DRINKS 11,448 7.4%
BAG SNACKS BAKED BREAD/BUNS/ROLLS 11,200 7.2%
BAKED BREAD/BUNS/ROLLS BEEF 10,625 6.8%
CHEESE SOFT DRINKS 10,480 6.7%
BAG SNACKS FLUID MILK PRODUCTS 10,454 6.7%

Findings

The top 3 pairs are highlighted in blue, with bread and milk products again clearly standing out from the rest in both raw volume and support percentage. With additional combos of bread, milk, soft drinks, and cheese standing out as well. These are the pairs where customer behavior is most consistent and where a targeted promotion will resonate.

The remaining pairs are still meaningful but represent a second tier of opportunity.

4.5 Are “Paired-Buyers” Spending More?

We now know how common each pairing is, but the real question is, does it matters financially. I can answer this by comparing the average total basket value for customers who bought the top pair together vs those who purchased just one category or neither.

#pull top pairs
top_cat_a <- pair_count$cat_a[1]
top_cat_b <- pair_count$cat_b[1]

# Label the baskets by whether it contains category A, B, both, or neither

basket_value <- trans_clean %>%
  group_by(basket_id) %>%
  summarise(basket_value = sum(sales_value),
            has_a = any(product_category == top_cat_a),
            has_b = any(product_category == top_cat_b), .groups = "drop") %>%
  mutate(purchase_type = case_when(
    has_a & has_b  ~ "Both",
    has_a & !has_b ~ paste("Only", top_cat_a),
    !has_a & has_b ~ paste("Only", top_cat_b),
    TRUE           ~ "Neither"
  ))
#plot avg. basket value by purchase group
basket_value %>%
  group_by(purchase_type) %>%
  summarise(avg_basket = mean(basket_value), .groups = "drop") %>%
  mutate(purchase_type = fct_reorder(purchase_type, avg_basket)) %>%
  ggplot(aes(x = avg_basket, y = purchase_type, fill = avg_basket)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = dollar(avg_basket, accuracy = 0.01)), hjust = -0.1) +
  scale_x_continuous(labels = dollar_format(), expand = expansion(mult = c(0, 0.15))) +
  scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
  labs(title = "Avg Basket Value by Purchase Combination", x = "Avg Basket Value ($)", y = NULL) +
  theme_minimal(base_size = 13)

Findings

As we can see customers who purchase both categories in the top pair spend significantly more per trip than those who buy only one and significantly more than those who buy neither.

This tells us that paired buyers are higher value shoppers overall. This shows us that initiatives that increases the rate at which customers buy these pairs together will not just increase each categories revenue, but likely increase the total revenue per trip.

4.6 Heatmap: Category Relationships at a Glance

I used the heatmap below to zoom out and shows the full relationship landscape across Regork’s top 10 categories in one snap shot. The darker cells indicate stronger co-purchase relationships between any two categories.

#top 10 categories by basket penetration
top10_cats <- penetration_data %>% slice_head(n = 10) %>% pull(product_category)

#mirro pairs for heatmap
pair_count %>%
  filter(cat_a %in% top10_cats, cat_b %in% top10_cats) %>%
  bind_rows(pair_count %>%
    filter(cat_a %in% top10_cats, cat_b %in% top10_cats) %>%
    rename(cat_a = cat_b, cat_b = cat_a)) %>%
  ggplot(aes(x = cat_a, y = cat_b, fill = n_baskets)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "#eaf4fb", high = "#1b4f72", labels = comma) +
  labs(title = "Category Co-occurrence Heatmap", x = NULL, y = NULL, fill = "Baskets") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

Findings

The heatmap reveals the presence of category “clusters”. These “clusters” are groups of categories that all tend to travel together in the same basket. They hint at broader shopping occasions such as a dinner run consisting of beef, cheese, milk, etc. or a weekend entertaining run consisting of snacks and soda. Regork could design entire promotional bundles around these occasions rather than just a single paired promotion.

5. Summary & Recommendation

What I Found

Regork’s shoppers follow consistent, predictable patterns in which items they purchase together. Several of these category pairs appear together across a substantial and meaningful share of all shopping trips. These findings show strong correlations, and repeatable signs of customer behavior. Shoppers who buy the top pairs together are overall spending the most per trip, making them Regork’s “high-value” shoppers.

What I Recommend

I propose Regork launch a “Complete Your Basket” pilot program based around the top 5 category pairs identified in this analysis. This program will have three components:

Paired Promotions “Buy X, save on Y” offers for the top 5 pairs given through an existing loyalty program. Since our shoppers are already buying these items together, this discount is rewarding their natural behaviors. Keeping promotion costs low while driving basket value.

Shelf and Product location Work with store managers to position the top pairing categories in close physical proximity, particularly in highly trafficked areas. Shoppers who are already buying one of the pair are more likely to add the second when it is right there in front of them.

Digital Triggers In the Regork app and loyalty program, we promote a “You might also want” trigger when a customer scans or adds a “high-frequency” category to their online cart.

Limitations & Next Steps

This analysis is a strong start, however it does comes with a few limitations that should be acknowledged.

Correlation is not causation. While I have shown that these categories are frequently purchased together, I have not proven that a promotion will cause an increase in co-purchasing. A fair next step would implementing a randomized A/B test running the paired promotion for one customer segment while holding it back from another.
This can help establish true causal impact before implementing a full roll-out.

Category level analysis. While working at the category level kept my analysis clean, it masked any product level details. A deeper product level market basket analysis could uncover additional pairing opportunities. For instance, rather than simply promoting “Beverages and Snacks” we could find a specific brand of soda and pair it with a specific brand of chips.

Seasonality not explored. Purchase patterns certainly change with seasonality and holidays. Layering in a time element would allow Regork to build a promotional calendar around shopping patterns and create campaigns that keep up with what customers are shopping for at any specific time of the year.

Customer segmentation. My analysis treated all households equally. By breaking down the data by household size, income, or even region could reveal pairing patterns more specific to Regork’s “highest-value” customer groups.