The business problem I address is that I believe Regork can invest in smarter promotions to increase revenue from customers who are already shopping there.
By reviewing hundreds of thousands of shopping trips from the Complete Journey transaction database, I identify which product categories shoppers consistently purchased together in single shopping trips. I then rank the strongest pairings by how often they appeared and compare that with the average spend of customers who bought the top pairs together vs. those who did not.
My findings give Regork a clear data-backed picture of how consumers are currently shopping and where to find the best opportunities for increasing revenue.
The following packages were used for this analysis
My analysis draws from two datasets within the Complete Journey database. First, the transactions dataset, containing every individual line-item purchased across all households and shopping trips. It includes the basket ID, quantity, and sales value.
Next, the products dataset, which contains a reference table linking each product ID to the category and type.
# Load data
transactions <- get_transactions()
# Join datasets
trans_products <- transactions %>%
left_join(products, by = "product_id")
# Check for any missing rows
n_missing <- sum(is.na(trans_products$product_category))
cat("Rows missing product category:", n_missing,
"(", round(100 * n_missing / nrow(trans_products), 1), "%)\n")## Rows missing product category: 7045 ( 0.5 %)
# clean data
trans_clean <- trans_products %>%
filter(!is.na(product_category), quantity > 0) %>%
select(basket_id, household_id, product_category, sales_value)
cat("Clean transaction lines:", nrow(trans_clean), "\n")## Clean transaction lines: 1458246
## Unique baskets: 155485
## Unique categories: 302
After joining and cleaning data I filtered out any purchases with zero quantity. This leaves a clean dataset ready to be analyzed.
First we want to understand the shopping landscape at Regork. Which product categories are the most common? We can measure this by looking at basket penetration, the percentage of all shopping trips that include at least one item from a given category.
total_baskets <- n_distinct(trans_clean$basket_id)
penetration_data <- trans_clean %>%
group_by(product_category) %>%
summarise(n_baskets = n_distinct(basket_id), total_spend = sum(sales_value)) %>%
mutate(penetration = n_baskets / total_baskets) %>%
arrange(desc(penetration))
penetration_data %>%
slice_head(n = 15) %>%
mutate(product_category = fct_reorder(product_category, penetration)) %>%
ggplot(aes(x = penetration, y = product_category, fill = penetration)) +
geom_col(show.legend = FALSE) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
labs(title = "Top 15 Categories by Basket Penetration", x = "Basket Penetration (%)", y = NULL) +
theme_minimal(base_size = 12)I find that a small number of categories dominate basket penetration, these primarily include your everyday staples. This is essential for our pairing analysis because high-penetration categories are the most natural triggers for paired promotions. When a shopper constantly buys from one of these categories, that is our queue to sell them on a complementary item.
For every shopping basket in the dataset, I can identify the unique product categories included and then generated every possible pair. Then count how many baskets contain each pair across the entire dataset.
# collapses every basket to a list of unique categories purchased
basket_data <- trans_clean %>%
group_by(basket_id) %>%
summarise(categories = list(unique(product_category)), .groups = "drop")
# returns all unique pairs from a category vector
pairs <- function(cats) {
if (length(cats) < 2) return(tibble(cat_a = character(), cat_b = character()))
pairs <- combn(sort(cats), 2)
tibble(cat_a = pairs[1, ], cat_b = pairs[2, ])
}
# Apply to each basket
all_pairs <- basket_data %>%
mutate(pairs = map(categories, pairs)) %>%
select(basket_id, pairs) %>%
unnest(pairs)
# Count pairs and calculate the % of all baskets containing each pair
pair_count <- all_pairs %>%
group_by(cat_a, cat_b) %>%
summarise(n_baskets = n(), .groups = "drop") %>%
mutate(support = n_baskets / total_baskets) %>%
arrange(desc(n_baskets))
cat("Total unique category pairs found:", nrow(pair_count), "\n")## Total unique category pairs found: 37439
Here I am able to produce a ranked list of every category pair found across all shopping trips, along with a support percentage that tells me how widespread each pairing is across the entire consumer base. Pairs at the top of this list are the ones that appear consistently and will form my recommendations.
This chart ranks the top 15 category pairs by how many baskets contained the pair. The total basket count is provided at the end of each bar.
pair_count %>%
slice_head(n = 15) %>%
mutate(pair_label = paste(cat_a, "+", cat_b),
pair_label = fct_reorder(pair_label, n_baskets)) %>%
ggplot(aes(x = n_baskets, y = pair_label, fill = n_baskets)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = comma(n_baskets)), hjust = -0.1, size = 3.5) +
scale_x_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
labs(title = "Top 15 Co-purchased Category Pairs", x = "Number of Baskets", y = NULL) +
theme_minimal(base_size = 13)As you can see the results reflect a consistent shopping behavior that repeats across the customer base trip after trip. The combo of Bread + Milk is by far the most consistent pair. With combos of milk, soda, bread and cheese being notable as well. These pairs show us where promotional investments should be used first.
To add context to our data, I’ve added a support percentage for each pair to the table below. The support percentage tells us what share of all shopping trips contained both categories. The top 5 combos are highlighted to show our strongest revenue opportunities.
pair_count %>%
slice_head(n = 10) %>%
mutate(support_pct = percent(support, accuracy = 0.1)) %>%
select(`Category A` = cat_a, `Category B` = cat_b,
`Baskets Together` = n_baskets, `Support` = support_pct) %>%
kable(align = c("l","l","r","r"), format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(1:5, background = "#eaf4fb")| Category A | Category B | Baskets Together | Support |
|---|---|---|---|
| BAKED BREAD/BUNS/ROLLS | FLUID MILK PRODUCTS | 17,531 | 11.3% |
| FLUID MILK PRODUCTS | SOFT DRINKS | 14,201 | 9.1% |
| CHEESE | FLUID MILK PRODUCTS | 13,999 | 9.0% |
| BAKED BREAD/BUNS/ROLLS | SOFT DRINKS | 13,674 | 8.8% |
| BAKED BREAD/BUNS/ROLLS | CHEESE | 13,602 | 8.7% |
| BAG SNACKS | SOFT DRINKS | 11,448 | 7.4% |
| BAG SNACKS | BAKED BREAD/BUNS/ROLLS | 11,200 | 7.2% |
| BAKED BREAD/BUNS/ROLLS | BEEF | 10,625 | 6.8% |
| CHEESE | SOFT DRINKS | 10,480 | 6.7% |
| BAG SNACKS | FLUID MILK PRODUCTS | 10,454 | 6.7% |
The top 3 pairs are highlighted in blue, with bread and milk products again clearly standing out from the rest in both raw volume and support percentage. With additional combos of bread, milk, soft drinks, and cheese standing out as well. These are the pairs where customer behavior is most consistent and where a targeted promotion will resonate.
The remaining pairs are still meaningful but represent a second tier of opportunity.
We now know how common each pairing is, but the real question is, does it matters financially. I can answer this by comparing the average total basket value for customers who bought the top pair together vs those who purchased just one category or neither.
#pull top pairs
top_cat_a <- pair_count$cat_a[1]
top_cat_b <- pair_count$cat_b[1]
# Label the baskets by whether it contains category A, B, both, or neither
basket_value <- trans_clean %>%
group_by(basket_id) %>%
summarise(basket_value = sum(sales_value),
has_a = any(product_category == top_cat_a),
has_b = any(product_category == top_cat_b), .groups = "drop") %>%
mutate(purchase_type = case_when(
has_a & has_b ~ "Both",
has_a & !has_b ~ paste("Only", top_cat_a),
!has_a & has_b ~ paste("Only", top_cat_b),
TRUE ~ "Neither"
))
#plot avg. basket value by purchase group
basket_value %>%
group_by(purchase_type) %>%
summarise(avg_basket = mean(basket_value), .groups = "drop") %>%
mutate(purchase_type = fct_reorder(purchase_type, avg_basket)) %>%
ggplot(aes(x = avg_basket, y = purchase_type, fill = avg_basket)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = dollar(avg_basket, accuracy = 0.01)), hjust = -0.1) +
scale_x_continuous(labels = dollar_format(), expand = expansion(mult = c(0, 0.15))) +
scale_fill_gradient(low = "#a8d8ea", high = "#1b4f72") +
labs(title = "Avg Basket Value by Purchase Combination", x = "Avg Basket Value ($)", y = NULL) +
theme_minimal(base_size = 13)As we can see customers who purchase both categories in the top pair spend significantly more per trip than those who buy only one and significantly more than those who buy neither.
This tells us that paired buyers are higher value shoppers overall. This shows us that initiatives that increases the rate at which customers buy these pairs together will not just increase each categories revenue, but likely increase the total revenue per trip.
I used the heatmap below to zoom out and shows the full relationship landscape across Regork’s top 10 categories in one snap shot. The darker cells indicate stronger co-purchase relationships between any two categories.
#top 10 categories by basket penetration
top10_cats <- penetration_data %>% slice_head(n = 10) %>% pull(product_category)
#mirro pairs for heatmap
pair_count %>%
filter(cat_a %in% top10_cats, cat_b %in% top10_cats) %>%
bind_rows(pair_count %>%
filter(cat_a %in% top10_cats, cat_b %in% top10_cats) %>%
rename(cat_a = cat_b, cat_b = cat_a)) %>%
ggplot(aes(x = cat_a, y = cat_b, fill = n_baskets)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "#eaf4fb", high = "#1b4f72", labels = comma) +
labs(title = "Category Co-occurrence Heatmap", x = NULL, y = NULL, fill = "Baskets") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 35, hjust = 1))The heatmap reveals the presence of category “clusters”. These “clusters” are groups of categories that all tend to travel together in the same basket. They hint at broader shopping occasions such as a dinner run consisting of beef, cheese, milk, etc. or a weekend entertaining run consisting of snacks and soda. Regork could design entire promotional bundles around these occasions rather than just a single paired promotion.
Regork’s shoppers follow consistent, predictable patterns in which items they purchase together. Several of these category pairs appear together across a substantial and meaningful share of all shopping trips. These findings show strong correlations, and repeatable signs of customer behavior. Shoppers who buy the top pairs together are overall spending the most per trip, making them Regork’s “high-value” shoppers.
I propose Regork launch a “Complete Your Basket” pilot program based around the top 5 category pairs identified in this analysis. This program will have three components:
Paired Promotions “Buy X, save on Y” offers for the top 5 pairs given through an existing loyalty program. Since our shoppers are already buying these items together, this discount is rewarding their natural behaviors. Keeping promotion costs low while driving basket value.
Shelf and Product location Work with store managers to position the top pairing categories in close physical proximity, particularly in highly trafficked areas. Shoppers who are already buying one of the pair are more likely to add the second when it is right there in front of them.
Digital Triggers In the Regork app and loyalty program, we promote a “You might also want” trigger when a customer scans or adds a “high-frequency” category to their online cart.
This analysis is a strong start, however it does comes with a few limitations that should be acknowledged.
Correlation is not causation. While I have shown
that these categories are frequently purchased together, I have not
proven that a promotion will cause an increase in co-purchasing. A fair
next step would implementing a randomized A/B test running the paired
promotion for one customer segment while holding it back from
another.
This can help establish true causal impact before implementing a full
roll-out.
Category level analysis. While working at the category level kept my analysis clean, it masked any product level details. A deeper product level market basket analysis could uncover additional pairing opportunities. For instance, rather than simply promoting “Beverages and Snacks” we could find a specific brand of soda and pair it with a specific brand of chips.
Seasonality not explored. Purchase patterns certainly change with seasonality and holidays. Layering in a time element would allow Regork to build a promotional calendar around shopping patterns and create campaigns that keep up with what customers are shopping for at any specific time of the year.
Customer segmentation. My analysis treated all households equally. By breaking down the data by household size, income, or even region could reveal pairing patterns more specific to Regork’s “highest-value” customer groups.