1 Executive Summary

  • Question: Do baskets with frozen pizza have a higher probability of also containing beer?
  • Answer (preview): Using catalog-driven tags for beer and frozen pizza, we estimate the relative lift in beer purchase probability when pizza is present and identify time/segment windows to target cross-promotions.
  • Recommendation: Pilot pizza↔︎beer bundles during high-lift periods (e.g., weekends, major sports dates), measure incremental revenue and repeat.

2 Data & Packages

We use the Complete Journey dataset to explore whether frozen pizza purchases are associated with higher beer purchase probability at Regork. The analysis relies on a small, reproducible toolkit: dplyr/tidyr for wrangling, stringr/janitor for tidy text and columns, lubridate for dates, ggplot2 for visuals, and gt for tables. To minimize false positives (e.g., “root beer” or pizza sauces), we tag beer and pizza from the product catalog itself (department → category → type), using a whitelist for pizza product types that explicitly contain “PIZZA,” and the observed beer taxonomy under GROCERY → BEERS/ALES. This catalog-driven tagging keeps results transparent and easy to audit.

We also run quick coverage checks to confirm that our beer rules capture nearly all beer sales in the relevant department(s) and that we aren’t accidentally tagging non-beer items. Where necessary, we refine the whitelist/blacklist and re-run the checks (documented in the Sensitivity section).

Used following packages: dplyr tidyr stringr lubridate janitor ggplot2 scales gt completejourney

library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(janitor)
library(ggplot2)
library(scales)
library(gt)
library(completejourney)
transactions <- get_transactions() |> clean_names()
products     <- completejourney::products |> clean_names()
demographics <- completejourney::demographics |> clean_names()

3 Tagging Logic (Beer & Pizza)

We use the catalog to avoid regex guesswork and prevent false positives (e.g., root beer, sauces).

UP <- function(x) toupper(coalesce(x, ""))
catalog_top <- products %>%
  count(department, product_category, product_type, sort = TRUE)
head(catalog_top, 25)
## # A tibble: 25 × 4
##    department product_category               product_type                 n
##    <chr>      <chr>                          <chr>                    <int>
##  1 DRUG GM    GREETING CARDS/WRAP/PARTY SPLY CARDS EVERYDAY            1005
##  2 GROCERY    BEERS/ALES                     BEERALEMALT LIQUORS        770
##  3 GROCERY    SPICES & EXTRACTS              SPICES & SEASONINGS        606
##  4 DRUG GM    GREETING CARDS/WRAP/PARTY SPLY GIFT-WRAP EVERYDAY         547
##  5 DRUG GM    HAIR CARE PRODUCTS             SHAMPOO                    512
##  6 COSMETICS  MAKEUP AND TREATMENT           MAYBELLINE                 509
##  7 GROCERY    YOGURT                         YOGURT NOT MULTI-PACKS     505
##  8 COSMETICS  MAKEUP AND TREATMENT           COVERGIRL                  501
##  9 GROCERY    BAG SNACKS                     POTATO CHIPS               483
## 10 GROCERY    SALD DRSNG/SNDWCH SPRD         POURABLE SALAD DRESSINGS   458
## # ℹ 15 more rows
beer_types <- products %>%
  filter(department == "GROCERY", product_category == "BEERS/ALES") %>%
  distinct(product_type) %>%
  pull()

beer_blacklist_regex <- "\\b(ROOT BEER|GINGER BEER|SARSAPARILLA|BEER BRAT|BEER BATTER|BEER BREAD)\\b"
pizza_tax <- products %>%
  mutate(department_u = UP(department),
         category_u   = UP(product_category),
         type_u       = UP(product_type)) %>%
  filter(str_detect(paste(category_u, type_u), "PIZZA")) %>%
  count(department_u, category_u, type_u, sort = TRUE)

pizza_whitelist <- pizza_tax %>%
  filter(
    str_detect(type_u, "\\bPIZZA\\b"),
    !str_detect(type_u, "SAUCE|MIX|QSR|PREP|DIP|SANDWICH|HANDHELD|BURRITO|SNACK|CORN DOG|KITE"),
    !(department_u %in% c("DELI", "RESTAURANT QUICK SERVICE"))
  ) %>%
  distinct(department_u, category_u, type_u) %>%
  mutate(in_pizza_whitelist = TRUE)

head(pizza_whitelist, 15)
## # A tibble: 8 × 4
##   department_u category_u             type_u                  in_pizza_whitelist
##   <chr>        <chr>                  <chr>                   <lgl>             
## 1 GROCERY      FROZEN PIZZA           PIZZA/TRADITIONAL       TRUE              
## 2 GROCERY      FROZEN PIZZA           PIZZA/PREMIUM           TRUE              
## 3 GROCERY      FROZEN PIZZA           PIZZA/ECONOMY           TRUE              
## 4 GROCERY      FROZEN PIZZA           PIZZA/SINGLE SERVE/MIC… TRUE              
## 5 NUTRITION    FROZEN                 FROZEN PIZZA            TRUE              
## 6 MEAT-PCKGD   FROZEN MEAT            PIZZA                   TRUE              
## 7 GROCERY      MISC. DAIRY            REFRIG PIZZA&INGRED  N… TRUE              
## 8 GROCERY      FRZN MEAT/MEAT DINNERS PIZZA/PREMIUM           TRUE
products_tagged <- products %>%
  mutate(
    department_u = UP(department),
    category_u   = UP(product_category),
    type_u       = UP(product_type),
    text_u       = paste(category_u, type_u, UP(brand))
  ) %>%
  
  left_join(pizza_whitelist, by = c("department_u","category_u","type_u")) %>%
  mutate(is_pizza = coalesce(in_pizza_whitelist, FALSE)) %>%
  select(-in_pizza_whitelist) %>%
  
  mutate(
    is_beer = department == "GROCERY" &
              product_category == "BEERS/ALES" &
              product_type %in% beer_types &
              !str_detect(text_u, beer_blacklist_regex)
  )

products_tagged %>%
  summarise(beer_skus = sum(is_beer), pizza_skus = sum(is_pizza))
## # A tibble: 1 × 2
##   beer_skus pizza_skus
##       <int>      <int>
## 1       793        464

4 Build Basket Frame

Transactions are joined to the tagged catalog so each line item carries is_beer and is_pizza. We then aggregate to the basket level (the unit closest to a customer’s “trip”) and compute:

beer_flag: did the basket contain any beer?

pizza_flag: did the basket contain any (true) pizza?

basket_sales: total sales value for that basket

txn_date and month: calendar keys for trend analysis

This basket frame allows clean estimation of co-purchase frequency, conditional probabilities (e.g., P(beer | pizza)), and time-based patterns (e.g., monthly spikes) without double-counting line items. Guardrails in the code warn if either flag is never observed, which usually indicates a tagging or join issue.

tx <- transactions %>%
  inner_join(products_tagged, by = "product_id")

stopifnot(all(c("is_beer","is_pizza","department_u") %in% names(tx)))

baskets <- tx %>%
  group_by(basket_id, household_id) %>%
  summarise(
    transaction_timestamp = first(transaction_timestamp),
    beer_flag  = any(is_beer,  na.rm = TRUE),
    pizza_flag = any(is_pizza, na.rm = TRUE),
    basket_sales = sum(sales_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    txn_date = as_date(transaction_timestamp),
    month    = floor_date(txn_date, "month")
  )


if (sum(baskets$beer_flag) == 0) warning("No beer baskets identified; check beer tagging rules.")
if (sum(baskets$pizza_flag) == 0) warning("No pizza baskets identified; check pizza tagging rules.")

5 KPIs

We summarize five headline KPIs to size the opportunity:

Overall P(beer): r scales::percent(kpi$overall_p_beer, 0.01) — baseline probability that a basket contains beer.

P(beer | pizza): r scales::percent(kpi$p_beer_given_pizza, 0.01) — probability of beer given the basket includes pizza.

P(beer | no pizza): r scales::percent(kpi$p_beer_given_no_pizza, 0.01) — counterfactual comparison group.

Lift: r round(kpi$lift_vs_overall, 2)× — the relative increase vs. baseline when pizza is present.

Co-purchase rate (beer ∧ pizza): r scales::percent(kpi$copurchase_rate, 0.01) — share of baskets that contain both.

kpi <- tibble(
  overall_p_beer        = mean(baskets$beer_flag),
  p_beer_given_pizza    = mean(baskets$beer_flag[baskets$pizza_flag]),
  p_beer_given_no_pizza = mean(baskets$beer_flag[!baskets$pizza_flag]),
  lift_vs_overall       = p_beer_given_pizza / overall_p_beer,
  copurchase_rate       = mean(baskets$beer_flag & baskets$pizza_flag),
  pizza_baskets         = sum(baskets$pizza_flag),
  beer_baskets          = sum(baskets$beer_flag),
  total_baskets         = nrow(baskets)
)

kpi %>%
  mutate(across(c(overall_p_beer, p_beer_given_pizza, p_beer_given_no_pizza, copurchase_rate),
                scales::percent_format(accuracy = 0.01)),
         lift_vs_overall = round(lift_vs_overall, 2)) %>%
  gt() %>%
  tab_header(title = "Beer–Pizza KPIs")
Beer–Pizza KPIs
overall_p_beer p_beer_given_pizza p_beer_given_no_pizza lift_vs_overall copurchase_rate pizza_baskets beer_baskets total_baskets
5.51% 8.03% 5.38% 1.46 0.41% 7871 8584 155691

6 Visualization Story

6.1 1. Probability of Buying Beer (with vs without Pizza)

beer_rates <- baskets %>%
  group_by(pizza_flag) %>%
  summarise(n = n(),
            hits = sum(beer_flag),
            p = hits / n,
            se = sqrt(p*(1-p)/n),
            lo = p - 1.96*se,
            hi = p + 1.96*se,
            .groups = "drop") %>%
  mutate(group = if_else(pizza_flag, "With Pizza", "Without Pizza"))

ggplot(beer_rates, aes(group, p)) +
  geom_col(width = 0.6) +
  geom_errorbar(aes(ymin = pmax(lo,0), ymax = pmin(hi,1)), width = 0.15) +
  geom_text(aes(label = percent(p, 0.01)), vjust = -0.5) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Probability of Buying Beer by Pizza Presence",
       y = "P(Beer)", x = NULL) +
  theme_minimal(base_size = 13)

6.2 2. Basket Combination Heatmap (Scale of Opportunity)

combo <- baskets %>%
  count(pizza_flag, beer_flag) %>%
  mutate(pizza = if_else(pizza_flag, "Pizza", "No pizza"),
         beer  = if_else(beer_flag,  "Beer",  "No beer"))

ggplot(combo, aes(pizza, beer, fill = n)) +
  geom_tile(color = "white") +
  geom_text(aes(label = comma(n)), size = 4) +
  scale_fill_continuous(name = "Baskets") +
  labs(title = "Beer–Pizza Basket Combinations", x = NULL, y = NULL) +
  theme_minimal(base_size = 13)

6.3 3. Monthly Trend (Beer, Pizza, Both)

trend <- baskets %>%
  group_by(month) %>%
  summarise(
    beer_rate  = mean(beer_flag),
    pizza_rate = mean(pizza_flag),
    both_rate  = mean(beer_flag & pizza_flag),
    .groups = "drop"
  ) %>%
  pivot_longer(beer_rate:both_rate, names_to="metric", values_to="rate")

ggplot(trend, aes(month, rate, color = metric)) +
  geom_line(linewidth = 1.1) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Monthly Share of Baskets with Beer, Pizza, and Both",
       x = NULL, y = "Share of baskets", color = NULL) +
  theme_minimal(base_size = 12)

7 Incremental Revenue Sizing (Back-of-Envelope)

To estimate the upper-bound incremental beer revenue attributable to pizza baskets, we use a simple, transparent formula: Incremental dollas ≈ ΔP (beer ∣ pizza) × # pizza baskets × avg beer $ per beer basket Plugging in our estimates:

ΔP(beer) from pizza vs. no pizza: r scales::percent(kpi dollars p_beer_given_pizza - kpi p_beer_given_no_pizza, 0.01)

Number of pizza baskets: r scales::comma(pizza_n)

Avg beer dollars conditional on beer being present: r scales::dollar(beer_spend_per_basket)

Back-of-envelope result: r scales::dollar(incremental_rev_total) total incremental beer sales across the sample window.

Assumptions & caveats. This sizing assumes: (1) the change in probability is causal (in practice, run a controlled test), (2) no cannibalization or inventory constraints, (3) no legal constraints on alcohol promotions, and (4) average beer spend remains stable. In a pilot, we would measure net incremental profit after promo costs and redemption leakage, with a matched-store or time-based control.

beer_spend_per_basket <- tx %>%
  group_by(basket_id) %>%
  summarise(beer_sales = sum(if_else(is_beer, sales_value, 0), na.rm = TRUE), .groups = "drop") %>%
  filter(beer_sales > 0) %>%
  summarise(avg_beer_sales = mean(beer_sales)) %>%
  pull(avg_beer_sales)

p_beer_pizza   <- mean(baskets$beer_flag[baskets$pizza_flag])
p_beer_nopizza <- mean(baskets$beer_flag[!baskets$pizza_flag])
delta_p        <- p_beer_pizza - p_beer_nopizza
pizza_n        <- sum(baskets$pizza_flag)
incremental_rev_total <- delta_p * pizza_n * beer_spend_per_basket

tibble(
  metric = c("Avg beer $ per beer basket", "ΔP(beer) from pizza", "# pizza baskets",
             "Est. incremental beer $ (total)"),
  value  = c(beer_spend_per_basket, delta_p, pizza_n, incremental_rev_total)
) %>%
  mutate(value_fmt = dplyr::case_when(
    metric == "ΔP(beer) from pizza" ~ scales::percent(value, 0.01),
    metric == "# pizza baskets"     ~ scales::comma(value),
    TRUE                            ~ scales::dollar(value)
  )) %>%
  select(metric, value_fmt) %>%
  gt() %>% tab_header(title = "Estimated Incremental Revenue from Pizza → Beer")
Estimated Incremental Revenue from Pizza → Beer
metric value_fmt
Avg beer $ per beer basket $9.57
ΔP(beer) from pizza 2.65%
# pizza baskets 7,871.0
Est. incremental beer $ (total) $1,995.91

8 Sensitivity & Data Quality Checks

  • Coverage: Are we capturing all true beer SKUs?

    beer_departments <- c("GROCERY")
    coverage <- tx %>%
      filter(department %in% beer_departments, product_category == "BEERS/ALES") %>%
      summarise(
        total_sales = sum(sales_value, na.rm = TRUE),
        beer_sales  = sum(if_else(is_beer, sales_value, 0), na.rm = TRUE),
        coverage_pct = beer_sales / total_sales
      )
    coverage
    ## # A tibble: 1 × 3
    ##   total_sales beer_sales coverage_pct
    ##         <dbl>      <dbl>        <dbl>
    ## 1      82027.     82027.            1
  • Missed top sellers (inspect and add to whitelist if true beer):

    tx %>%
      filter(department == "GROCERY", product_category == "BEERS/ALES", !is_beer) %>%
      group_by(product_id, product_type) %>%
      summarise(sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
      arrange(desc(sales)) %>% head(25)
    ## # A tibble: 0 × 3
    ## # ℹ 3 variables: product_id <chr>, product_type <chr>, sales <dbl>

9 Recommendation

  • Run a 4–6 week pilot: bundle a top frozen pizza SKU with a mainstream beer style.
  • Launch during high-lift weeks identified in the trend chart; feature endcaps + digital coupon.
  • Measure: ΔP(beer) in pizza baskets, incremental beer $/basket, redemption rate, and net lift vs control stores.

10 Appendix

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] completejourney_1.1.0 gt_1.1.0              scales_1.4.0         
## [4] ggplot2_3.5.2         janitor_2.2.1         lubridate_1.9.4      
## [7] stringr_1.5.2         tidyr_1.3.1           dplyr_1.1.4          
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.6         sass_0.4.10        generics_0.1.4     xml2_1.4.0        
##  [5] stringi_1.8.7      hms_1.1.3          digest_0.6.37      magrittr_2.0.3    
##  [9] evaluate_1.0.5     grid_4.5.1         timechange_0.3.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      jsonlite_2.0.0     progress_1.2.3     purrr_1.1.0       
## [17] codetools_0.2-20   jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6       
## [21] crayon_1.5.3       withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [25] tools_4.5.1        zeallot_0.2.0      curl_7.0.0         vctrs_0.6.5       
## [29] R6_2.6.1           lifecycle_1.0.4    snakecase_0.11.1   fs_1.6.6          
## [33] pkgconfig_2.0.3    pillar_1.11.0      bslib_0.9.0        gtable_0.3.6      
## [37] glue_1.8.0         xfun_0.52          tibble_3.3.0       tidyselect_1.2.1  
## [41] rstudioapi_0.17.1  knitr_1.50         farver_2.1.2       htmltools_0.5.8.1 
## [45] labeling_0.4.3     rmarkdown_2.29     compiler_4.5.1     prettyunits_1.2.0