We use the Complete Journey dataset to explore whether frozen pizza purchases are associated with higher beer purchase probability at Regork. The analysis relies on a small, reproducible toolkit: dplyr/tidyr for wrangling, stringr/janitor for tidy text and columns, lubridate for dates, ggplot2 for visuals, and gt for tables. To minimize false positives (e.g., “root beer” or pizza sauces), we tag beer and pizza from the product catalog itself (department → category → type), using a whitelist for pizza product types that explicitly contain “PIZZA,” and the observed beer taxonomy under GROCERY → BEERS/ALES. This catalog-driven tagging keeps results transparent and easy to audit.
We also run quick coverage checks to confirm that our beer rules capture nearly all beer sales in the relevant department(s) and that we aren’t accidentally tagging non-beer items. Where necessary, we refine the whitelist/blacklist and re-run the checks (documented in the Sensitivity section).
Used following packages: dplyr tidyr stringr lubridate janitor ggplot2 scales gt completejourney
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(janitor)
library(ggplot2)
library(scales)
library(gt)
library(completejourney)
transactions <- get_transactions() |> clean_names()
products <- completejourney::products |> clean_names()
demographics <- completejourney::demographics |> clean_names()
We use the catalog to avoid regex guesswork and prevent false positives (e.g., root beer, sauces).
UP <- function(x) toupper(coalesce(x, ""))
catalog_top <- products %>%
count(department, product_category, product_type, sort = TRUE)
head(catalog_top, 25)
## # A tibble: 25 × 4
## department product_category product_type n
## <chr> <chr> <chr> <int>
## 1 DRUG GM GREETING CARDS/WRAP/PARTY SPLY CARDS EVERYDAY 1005
## 2 GROCERY BEERS/ALES BEERALEMALT LIQUORS 770
## 3 GROCERY SPICES & EXTRACTS SPICES & SEASONINGS 606
## 4 DRUG GM GREETING CARDS/WRAP/PARTY SPLY GIFT-WRAP EVERYDAY 547
## 5 DRUG GM HAIR CARE PRODUCTS SHAMPOO 512
## 6 COSMETICS MAKEUP AND TREATMENT MAYBELLINE 509
## 7 GROCERY YOGURT YOGURT NOT MULTI-PACKS 505
## 8 COSMETICS MAKEUP AND TREATMENT COVERGIRL 501
## 9 GROCERY BAG SNACKS POTATO CHIPS 483
## 10 GROCERY SALD DRSNG/SNDWCH SPRD POURABLE SALAD DRESSINGS 458
## # ℹ 15 more rows
beer_types <- products %>%
filter(department == "GROCERY", product_category == "BEERS/ALES") %>%
distinct(product_type) %>%
pull()
beer_blacklist_regex <- "\\b(ROOT BEER|GINGER BEER|SARSAPARILLA|BEER BRAT|BEER BATTER|BEER BREAD)\\b"
pizza_tax <- products %>%
mutate(department_u = UP(department),
category_u = UP(product_category),
type_u = UP(product_type)) %>%
filter(str_detect(paste(category_u, type_u), "PIZZA")) %>%
count(department_u, category_u, type_u, sort = TRUE)
pizza_whitelist <- pizza_tax %>%
filter(
str_detect(type_u, "\\bPIZZA\\b"),
!str_detect(type_u, "SAUCE|MIX|QSR|PREP|DIP|SANDWICH|HANDHELD|BURRITO|SNACK|CORN DOG|KITE"),
!(department_u %in% c("DELI", "RESTAURANT QUICK SERVICE"))
) %>%
distinct(department_u, category_u, type_u) %>%
mutate(in_pizza_whitelist = TRUE)
head(pizza_whitelist, 15)
## # A tibble: 8 × 4
## department_u category_u type_u in_pizza_whitelist
## <chr> <chr> <chr> <lgl>
## 1 GROCERY FROZEN PIZZA PIZZA/TRADITIONAL TRUE
## 2 GROCERY FROZEN PIZZA PIZZA/PREMIUM TRUE
## 3 GROCERY FROZEN PIZZA PIZZA/ECONOMY TRUE
## 4 GROCERY FROZEN PIZZA PIZZA/SINGLE SERVE/MIC… TRUE
## 5 NUTRITION FROZEN FROZEN PIZZA TRUE
## 6 MEAT-PCKGD FROZEN MEAT PIZZA TRUE
## 7 GROCERY MISC. DAIRY REFRIG PIZZA&INGRED N… TRUE
## 8 GROCERY FRZN MEAT/MEAT DINNERS PIZZA/PREMIUM TRUE
products_tagged <- products %>%
mutate(
department_u = UP(department),
category_u = UP(product_category),
type_u = UP(product_type),
text_u = paste(category_u, type_u, UP(brand))
) %>%
left_join(pizza_whitelist, by = c("department_u","category_u","type_u")) %>%
mutate(is_pizza = coalesce(in_pizza_whitelist, FALSE)) %>%
select(-in_pizza_whitelist) %>%
mutate(
is_beer = department == "GROCERY" &
product_category == "BEERS/ALES" &
product_type %in% beer_types &
!str_detect(text_u, beer_blacklist_regex)
)
products_tagged %>%
summarise(beer_skus = sum(is_beer), pizza_skus = sum(is_pizza))
## # A tibble: 1 × 2
## beer_skus pizza_skus
## <int> <int>
## 1 793 464
Transactions are joined to the tagged catalog so each line item carries is_beer and is_pizza. We then aggregate to the basket level (the unit closest to a customer’s “trip”) and compute:
beer_flag: did the basket contain any beer?
pizza_flag: did the basket contain any (true) pizza?
basket_sales: total sales value for that basket
txn_date and month: calendar keys for trend analysis
This basket frame allows clean estimation of co-purchase frequency, conditional probabilities (e.g., P(beer | pizza)), and time-based patterns (e.g., monthly spikes) without double-counting line items. Guardrails in the code warn if either flag is never observed, which usually indicates a tagging or join issue.
tx <- transactions %>%
inner_join(products_tagged, by = "product_id")
stopifnot(all(c("is_beer","is_pizza","department_u") %in% names(tx)))
baskets <- tx %>%
group_by(basket_id, household_id) %>%
summarise(
transaction_timestamp = first(transaction_timestamp),
beer_flag = any(is_beer, na.rm = TRUE),
pizza_flag = any(is_pizza, na.rm = TRUE),
basket_sales = sum(sales_value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
txn_date = as_date(transaction_timestamp),
month = floor_date(txn_date, "month")
)
if (sum(baskets$beer_flag) == 0) warning("No beer baskets identified; check beer tagging rules.")
if (sum(baskets$pizza_flag) == 0) warning("No pizza baskets identified; check pizza tagging rules.")
We summarize five headline KPIs to size the opportunity:
Overall P(beer): r scales::percent(kpi$overall_p_beer, 0.01) — baseline probability that a basket contains beer.
P(beer | pizza): r scales::percent(kpi$p_beer_given_pizza, 0.01) — probability of beer given the basket includes pizza.
P(beer | no pizza): r scales::percent(kpi$p_beer_given_no_pizza, 0.01) — counterfactual comparison group.
Lift: r round(kpi$lift_vs_overall, 2)× — the relative increase vs. baseline when pizza is present.
Co-purchase rate (beer ∧ pizza): r scales::percent(kpi$copurchase_rate, 0.01) — share of baskets that contain both.
kpi <- tibble(
overall_p_beer = mean(baskets$beer_flag),
p_beer_given_pizza = mean(baskets$beer_flag[baskets$pizza_flag]),
p_beer_given_no_pizza = mean(baskets$beer_flag[!baskets$pizza_flag]),
lift_vs_overall = p_beer_given_pizza / overall_p_beer,
copurchase_rate = mean(baskets$beer_flag & baskets$pizza_flag),
pizza_baskets = sum(baskets$pizza_flag),
beer_baskets = sum(baskets$beer_flag),
total_baskets = nrow(baskets)
)
kpi %>%
mutate(across(c(overall_p_beer, p_beer_given_pizza, p_beer_given_no_pizza, copurchase_rate),
scales::percent_format(accuracy = 0.01)),
lift_vs_overall = round(lift_vs_overall, 2)) %>%
gt() %>%
tab_header(title = "Beer–Pizza KPIs")
Beer–Pizza KPIs | |||||||
overall_p_beer | p_beer_given_pizza | p_beer_given_no_pizza | lift_vs_overall | copurchase_rate | pizza_baskets | beer_baskets | total_baskets |
---|---|---|---|---|---|---|---|
5.51% | 8.03% | 5.38% | 1.46 | 0.41% | 7871 | 8584 | 155691 |
beer_rates <- baskets %>%
group_by(pizza_flag) %>%
summarise(n = n(),
hits = sum(beer_flag),
p = hits / n,
se = sqrt(p*(1-p)/n),
lo = p - 1.96*se,
hi = p + 1.96*se,
.groups = "drop") %>%
mutate(group = if_else(pizza_flag, "With Pizza", "Without Pizza"))
ggplot(beer_rates, aes(group, p)) +
geom_col(width = 0.6) +
geom_errorbar(aes(ymin = pmax(lo,0), ymax = pmin(hi,1)), width = 0.15) +
geom_text(aes(label = percent(p, 0.01)), vjust = -0.5) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Probability of Buying Beer by Pizza Presence",
y = "P(Beer)", x = NULL) +
theme_minimal(base_size = 13)
combo <- baskets %>%
count(pizza_flag, beer_flag) %>%
mutate(pizza = if_else(pizza_flag, "Pizza", "No pizza"),
beer = if_else(beer_flag, "Beer", "No beer"))
ggplot(combo, aes(pizza, beer, fill = n)) +
geom_tile(color = "white") +
geom_text(aes(label = comma(n)), size = 4) +
scale_fill_continuous(name = "Baskets") +
labs(title = "Beer–Pizza Basket Combinations", x = NULL, y = NULL) +
theme_minimal(base_size = 13)
trend <- baskets %>%
group_by(month) %>%
summarise(
beer_rate = mean(beer_flag),
pizza_rate = mean(pizza_flag),
both_rate = mean(beer_flag & pizza_flag),
.groups = "drop"
) %>%
pivot_longer(beer_rate:both_rate, names_to="metric", values_to="rate")
ggplot(trend, aes(month, rate, color = metric)) +
geom_line(linewidth = 1.1) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Monthly Share of Baskets with Beer, Pizza, and Both",
x = NULL, y = "Share of baskets", color = NULL) +
theme_minimal(base_size = 12)
To estimate the upper-bound incremental beer revenue attributable to pizza baskets, we use a simple, transparent formula: Incremental dollas ≈ ΔP (beer ∣ pizza) × # pizza baskets × avg beer $ per beer basket Plugging in our estimates:
ΔP(beer) from pizza vs. no pizza: r scales::percent(kpi dollars p_beer_given_pizza - kpi p_beer_given_no_pizza, 0.01)
Number of pizza baskets: r scales::comma(pizza_n)
Avg beer dollars conditional on beer being present: r scales::dollar(beer_spend_per_basket)
Back-of-envelope result: r scales::dollar(incremental_rev_total) total incremental beer sales across the sample window.
Assumptions & caveats. This sizing assumes: (1) the change in probability is causal (in practice, run a controlled test), (2) no cannibalization or inventory constraints, (3) no legal constraints on alcohol promotions, and (4) average beer spend remains stable. In a pilot, we would measure net incremental profit after promo costs and redemption leakage, with a matched-store or time-based control.
beer_spend_per_basket <- tx %>%
group_by(basket_id) %>%
summarise(beer_sales = sum(if_else(is_beer, sales_value, 0), na.rm = TRUE), .groups = "drop") %>%
filter(beer_sales > 0) %>%
summarise(avg_beer_sales = mean(beer_sales)) %>%
pull(avg_beer_sales)
p_beer_pizza <- mean(baskets$beer_flag[baskets$pizza_flag])
p_beer_nopizza <- mean(baskets$beer_flag[!baskets$pizza_flag])
delta_p <- p_beer_pizza - p_beer_nopizza
pizza_n <- sum(baskets$pizza_flag)
incremental_rev_total <- delta_p * pizza_n * beer_spend_per_basket
tibble(
metric = c("Avg beer $ per beer basket", "ΔP(beer) from pizza", "# pizza baskets",
"Est. incremental beer $ (total)"),
value = c(beer_spend_per_basket, delta_p, pizza_n, incremental_rev_total)
) %>%
mutate(value_fmt = dplyr::case_when(
metric == "ΔP(beer) from pizza" ~ scales::percent(value, 0.01),
metric == "# pizza baskets" ~ scales::comma(value),
TRUE ~ scales::dollar(value)
)) %>%
select(metric, value_fmt) %>%
gt() %>% tab_header(title = "Estimated Incremental Revenue from Pizza → Beer")
Estimated Incremental Revenue from Pizza → Beer | |
metric | value_fmt |
---|---|
Avg beer $ per beer basket | $9.57 |
ΔP(beer) from pizza | 2.65% |
# pizza baskets | 7,871.0 |
Est. incremental beer $ (total) | $1,995.91 |
Coverage: Are we capturing all true beer SKUs?
beer_departments <- c("GROCERY")
coverage <- tx %>%
filter(department %in% beer_departments, product_category == "BEERS/ALES") %>%
summarise(
total_sales = sum(sales_value, na.rm = TRUE),
beer_sales = sum(if_else(is_beer, sales_value, 0), na.rm = TRUE),
coverage_pct = beer_sales / total_sales
)
coverage
## # A tibble: 1 × 3
## total_sales beer_sales coverage_pct
## <dbl> <dbl> <dbl>
## 1 82027. 82027. 1
Missed top sellers (inspect and add to whitelist if true beer):
tx %>%
filter(department == "GROCERY", product_category == "BEERS/ALES", !is_beer) %>%
group_by(product_id, product_type) %>%
summarise(sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(sales)) %>% head(25)
## # A tibble: 0 × 3
## # ℹ 3 variables: product_id <chr>, product_type <chr>, sales <dbl>
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] completejourney_1.1.0 gt_1.1.0 scales_1.4.0
## [4] ggplot2_3.5.2 janitor_2.2.1 lubridate_1.9.4
## [7] stringr_1.5.2 tidyr_1.3.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4 xml2_1.4.0
## [5] stringi_1.8.7 hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [9] evaluate_1.0.5 grid_4.5.1 timechange_0.3.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 jsonlite_2.0.0 progress_1.2.3 purrr_1.1.0
## [17] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6
## [21] crayon_1.5.3 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [25] tools_4.5.1 zeallot_0.2.0 curl_7.0.0 vctrs_0.6.5
## [29] R6_2.6.1 lifecycle_1.0.4 snakecase_0.11.1 fs_1.6.6
## [33] pkgconfig_2.0.3 pillar_1.11.0 bslib_0.9.0 gtable_0.3.6
## [37] glue_1.8.0 xfun_0.52 tibble_3.3.0 tidyselect_1.2.1
## [41] rstudioapi_0.17.1 knitr_1.50 farver_2.1.2 htmltools_0.5.8.1
## [45] labeling_0.4.3 rmarkdown_2.29 compiler_4.5.1 prettyunits_1.2.0