Regork (a national grocery chain) invests heavily in promotions: in-store displays, mailers, and coupons. Promotions are intended to increase sales, but discounts can also erode revenue and profit if they do not generate enough incremental volume.
This report answers one focused business question:
Which customer segments respond profitably to promotions (campaigns/coupons), and where are we over-discounting with little incremental lift?
Promotions are a major, controllable lever for growth. If discounting is not paying back in incremental revenue (and especially if it is not paying back in margin), then:
Finding which segments (and discount tools) are most efficient lets Regork reallocate promotional spend toward higher-return activity.
Data used (2 required tables):
transactions: customer purchases (household, basket,
product, quantities, net sales, discount components, week)promotions: product-store-week display/mailer
placementsKey engineering:
sales_value + total_discount (since list price isn’t
directly provided).promo_any, promo_display,
promo_mailer).Incremental lift approach (interpretable and robust):
Important: The dataset does not include item-level margins/COGS, so “profitability” is assessed using a revenue-based ROI proxy. I highlight limitations and how to improve the measurement.
This analysis produces:
This report uses:
Messages/warnings from package loading are suppressed for readability and reproducibility.
inventory <- readr::read_csv(file.path(tbl_dir, "data_inventory.csv"), show_col_types = FALSE)
render_table(inventory, caption = "Data inventory (row/column counts from the analysis pipeline)")
| table | rows | cols |
|---|---|---|
| transactions | 1460438 | 19 |
| promotions | 20927744 | 6 |
| products_from_get_data | 0 | 0 |
| demographics_from_get_data | 0 | 0 |
| tx_joined | 1460438 | 23 |
Notes on available tables.
In this course distribution of Complete Journey, only
transactions and promotions were available (no
product hierarchy table, no demographics table). As a result:
weekly <- readr::read_csv(file.path(tbl_dir, "weekly_summary.csv"), show_col_types = FALSE)
weekly_summary <- weekly %>%
summarise(
weeks = n(),
avg_weekly_net_sales = mean(net_sales),
avg_discount_rate = mean(discount_rate),
min_discount_rate = min(discount_rate),
max_discount_rate = max(discount_rate),
avg_promo_share_net = mean(promo_share_net)
)
render_table(
weekly_summary %>%
mutate(
avg_weekly_net_sales = dollar(avg_weekly_net_sales),
avg_discount_rate = percent(avg_discount_rate, accuracy = 0.1),
min_discount_rate = percent(min_discount_rate, accuracy = 0.1),
max_discount_rate = percent(max_discount_rate, accuracy = 0.1),
avg_promo_share_net = percent(avg_promo_share_net, accuracy = 0.1)
),
caption = "High-level weekly summary across the full time window"
)
| weeks | avg_weekly_net_sales | avg_discount_rate | min_discount_rate | max_discount_rate | avg_promo_share_net |
|---|---|---|---|---|---|
| 53 | $86,717.49 | 15.1% | 13.3% | 18.7% | 19.6% |
knitr::include_graphics(c(
file.path(fig_dir, "weekly_net_sales.png"),
file.path(fig_dir, "weekly_discount_rate.png")
))
What stands out:
kpi_value <- readr::read_csv(file.path(tbl_dir, "segment_kpis_value_segment.csv"), show_col_types = FALSE) %>%
mutate(
net_sales = dollar(net_sales),
gross_sales = dollar(gross_sales),
discount = dollar(discount),
discount_rate = percent(discount_rate, accuracy = 0.1),
promo_net_sales = dollar(promo_net_sales),
promo_share_net = percent(promo_share_net, accuracy = 0.1),
coupon_discount = dollar(coupon_discount)
)
render_table(kpi_value, caption = "Segment KPIs by Value Segment (quartiles of household net sales)")
| value_segment | households | trips | net_sales | gross_sales | discount | discount_rate | promo_net_sales | promo_share_net | coupon_discount |
|---|---|---|---|---|---|---|---|---|---|
| Value Q4 (Top) | 617 | 80978 | $2,846,046 | $3,328,250 | $482,204 | 14.5% | $530,990 | 18.7% | $16,622.95 |
| Value Q3 | 617 | 43415 | $1,110,066 | $1,316,928 | $206,861 | 15.7% | $229,408 | 20.7% | $5,492.73 |
| Value Q2 | 617 | 22480 | $501,210 | $600,736 | $99,526 | 16.6% | $113,511 | 22.6% | $1,693.64 |
| Value Q1 (Low) | 618 | 8645 | $138,705 | $166,121 | $27,416 | 16.5% | $30,609 | 22.1% | $548.93 |
Interpretation:
Because margins are not available, I use a revenue-based ROI proxy:
\[ \text{ROI proxy} = \frac{\text{Gross lift} - \text{Discount lift}}{\text{Discount lift}} \]
promo_lift <- readr::read_csv(file.path(tbl_dir, "promo_lift_by_value_segment.csv"), show_col_types = FALSE) %>%
mutate(
net_lift_promo_any = round(net_lift_promo_any, 4),
discount_lift_promo_any = round(discount_lift_promo_any, 4),
roi_proxy = round(roi_proxy, 4)
)
render_table(promo_lift, caption = "Estimated promo lift and ROI proxy (promotion vs non-promotion) by Value Segment")
| value_segment | net_lift_promo_any | discount_lift_promo_any | roi_proxy |
|---|---|---|---|
| Value Q1 (Low) | -0.2432 | 0.7873 | -0.3088 |
| Value Q2 | -0.1716 | 0.7751 | -0.2213 |
| Value Q3 | -0.2179 | 0.7650 | -0.2848 |
| Value Q4 (Top) | -0.1621 | 0.8734 | -0.1855 |
knitr::include_graphics(file.path(fig_dir, "promo_roi_proxy_value_segment.png"))
Finding: ROI proxy is negative for every value segment.
CEO implication (revenue proxy): current broad promotions look like over-discounting, especially for low-value households. If margins are thin (typical grocery), the profit impact is likely even worse.
Coupon “efficiency” is evaluated at the basket level as:
coupon_eff <- readr::read_csv(file.path(tbl_dir, "coupon_efficiency_by_segment.csv"), show_col_types = FALSE) %>%
mutate(
coef_gross_per_1_coupon = round(coef_gross_per_1_coupon, 3),
net_return_per_1_coupon = round(net_return_per_1_coupon, 3)
)
render_table(coupon_eff, caption = "Coupon efficiency (revenue proxy) by segment")
| segment_var | segment_level | coef_gross_per_1_coupon | net_return_per_1_coupon |
|---|---|---|---|
| value_segment | Value Q3 | 9.371 | 8.371 |
| value_segment | Value Q1 (Low) | 4.791 | 3.791 |
| value_segment | Value Q4 (Top) | 11.206 | 10.206 |
| value_segment | Value Q2 | 11.728 | 10.728 |
| promo_prone_segment | Promo Med-Low | 11.198 | 10.198 |
| promo_prone_segment | Promo Low | 12.461 | 11.461 |
| promo_prone_segment | Promo High | 9.004 | 8.004 |
| promo_prone_segment | Promo Med-High | 11.131 | 10.131 |
coupon_highlights <- coupon_eff %>%
group_by(segment_var) %>%
arrange(desc(net_return_per_1_coupon)) %>%
slice_head(n = 1) %>%
ungroup()
render_table(coupon_highlights, caption = "Highest coupon net-return segments (within each segmentation)")
| segment_var | segment_level | coef_gross_per_1_coupon | net_return_per_1_coupon |
|---|---|---|---|
| promo_prone_segment | Promo Low | 12.461 | 11.461 |
| value_segment | Value Q2 | 11.728 | 10.728 |
Interpretation (cautious):
Caution: Coupon usage is not randomly assigned. This result should be treated as a prioritization signal for testing, not a definitive causal estimate.
promo_flags <- readr::read_csv(file.path(tbl_dir, "promo_overdiscount_flags.csv"), show_col_types = FALSE) %>%
mutate(roi_proxy = round(roi_proxy, 4))
coupon_flags <- readr::read_csv(file.path(tbl_dir, "coupon_overdiscount_flags.csv"), show_col_types = FALSE) %>%
mutate(
coef_gross_per_1_coupon = round(coef_gross_per_1_coupon, 3),
net_return_per_1_coupon = round(net_return_per_1_coupon, 3)
)
render_table(promo_flags, caption = "Promotion over-discount flags (ROI proxy ≤ 0)")
| value_segment | net_lift_promo_any | discount_lift_promo_any | roi_proxy | over_discount_flag |
|---|---|---|---|---|
| Value Q1 (Low) | -0.2431593 | 0.7873239 | -0.3088 | TRUE |
| Value Q2 | -0.1715567 | 0.7751448 | -0.2213 | TRUE |
| Value Q3 | -0.2179016 | 0.7650251 | -0.2848 | TRUE |
| Value Q4 (Top) | -0.1620533 | 0.8733747 | -0.1855 | TRUE |
render_table(coupon_flags, caption = "Coupon over-discount flags (net return < 0)")
| segment_var | segment_level | coef_gross_per_1_coupon | net_return_per_1_coupon | over_discount_flag |
|---|---|---|---|---|
| value_segment | Value Q3 | 9.371 | 8.371 | FALSE |
| value_segment | Value Q1 (Low) | 4.791 | 3.791 | FALSE |
| value_segment | Value Q4 (Top) | 11.206 | 10.206 | FALSE |
| value_segment | Value Q2 | 11.728 | 10.728 | FALSE |
| promo_prone_segment | Promo Med-Low | 11.198 | 10.198 | FALSE |
| promo_prone_segment | Promo Low | 12.461 | 11.461 | FALSE |
| promo_prone_segment | Promo High | 9.004 | 8.004 | FALSE |
| promo_prone_segment | Promo Med-High | 11.131 | 10.131 | FALSE |
Bottom line:
Based on these results, the most defensible strategy is not “more promotions,” but “smarter promotions.” Specifically:
To move from “correlation with fixed effects” to decision-grade evidence:
product_id rather
than department/category.The report is designed to knit quickly by loading precomputed outputs
in outputs/.
To reproduce outputs from raw data, run the full analysis script first
(the one that: loads get_transactions() and
get_promotions(), aggregates promotions, creates segments,
runs fixed-effect models, and writes the tables/figures).
# ============================
# Full analysis script used to generate the outputs/ folder
# (shown for reproducibility; not executed when knitting)
# ============================
options(stringsAsFactors = FALSE)
set.seed(123)
pkgs <- c(
"completejourney",
"data.table",
"janitor",
"lubridate",
"fixest",
"ggplot2",
"dplyr",
"readr",
"scales"
)
to_install <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages({
library(completejourney)
library(data.table)
library(janitor)
library(lubridate)
library(fixest)
library(ggplot2)
library(dplyr)
library(readr)
library(scales)
})
dir.create("outputs", showWarnings = FALSE)
dir.create("outputs/tables", showWarnings = FALSE)
dir.create("outputs/figures", showWarnings = FALSE)
dir.create("outputs/models", showWarnings = FALSE)
# 1) Load data
transactions_raw <- completejourney::get_transactions()
promotions_raw <- completejourney::get_promotions()
# 2) Clean transactions
tx <- transactions_raw %>%
janitor::clean_names() %>%
mutate(
week = as.integer(week),
transaction_timestamp = suppressWarnings(lubridate::ymd_hms(transaction_timestamp, tz = "UTC")),
quantity = as.numeric(quantity),
sales_value = as.numeric(sales_value),
retail_disc = as.numeric(retail_disc),
coupon_disc = as.numeric(coupon_disc),
coupon_match_disc = as.numeric(coupon_match_disc)
) %>%
filter(!is.na(household_id), !is.na(product_id), !is.na(store_id), !is.na(week)) %>%
filter(!is.na(quantity), quantity > 0) %>%
mutate(
retail_disc_amt = abs(coalesce(retail_disc, 0)),
coupon_disc_amt = abs(coalesce(coupon_disc, 0)),
coupon_match_disc_amt = abs(coalesce(coupon_match_disc, 0)),
coupon_total_amt = coupon_disc_amt + coupon_match_disc_amt,
total_disc_amt = retail_disc_amt + coupon_total_amt,
gross_sales_est = sales_value + total_disc_amt,
used_coupon = as.integer(coupon_total_amt > 0),
any_discount = as.integer(total_disc_amt > 0)
)
# 3) Promotions aggregation with data.table (fast)
prom_dt <- as.data.table(promotions_raw)
setnames(prom_dt, names(prom_dt), janitor::make_clean_names(names(prom_dt)))
need_cols <- c("product_id", "store_id", "week", "display_location", "mailer_location")
prom_dt <- prom_dt[, ..need_cols]
prom_dt[, week := as.integer(week)]
prom_dt[, display_location := as.character(display_location)]
prom_dt[, mailer_location := as.character(mailer_location)]
prom_dt[, promo_display := fifelse(is.na(display_location) | display_location %in% c("0","NONE","none","","NA"), 0L, 1L)]
prom_dt[, promo_mailer := fifelse(is.na(mailer_location) | mailer_location %in% c("0","NONE","none","","NA"), 0L, 1L)]
promotions <- prom_dt[
,
.(
promo_display = as.integer(max(promo_display)),
promo_mailer = as.integer(max(promo_mailer)),
promo_any = as.integer(max(promo_display | promo_mailer))
),
by = .(product_id, store_id, week)
]
rm(promotions_raw, prom_dt); gc()
# Join promotions -> transactions
tx <- as.data.table(tx)
setDT(promotions)
setkey(tx, product_id, store_id, week)
setkey(promotions, product_id, store_id, week)
tx <- promotions[tx] # left join
tx[is.na(promo_display), promo_display := 0L]
tx[is.na(promo_mailer), promo_mailer := 0L]
tx[is.na(promo_any), promo_any := 0L]
# 4) Household segments
hh <- tx[
,
.(
total_net_sales = sum(sales_value, na.rm = TRUE),
total_gross_sales = sum(gross_sales_est, na.rm = TRUE),
total_discount = sum(total_disc_amt, na.rm = TRUE),
coupon_discount = sum(coupon_total_amt, na.rm = TRUE),
promo_net_sales = sum(fifelse(promo_any == 1L, sales_value, 0), na.rm = TRUE),
trips = uniqueN(basket_id),
weeks_active = uniqueN(week),
units = sum(quantity, na.rm = TRUE)
),
by = .(household_id)
]
hh[, discount_rate := fifelse(total_gross_sales > 0, total_discount / total_gross_sales, NA_real_)]
hh[, promo_share_net := fifelse(total_net_sales > 0, promo_net_sales / total_net_sales, NA_real_)]
hh[, value_quartile := dplyr::ntile(total_net_sales, 4)]
hh[, promo_quartile := dplyr::ntile(fifelse(is.na(promo_share_net), 0, promo_share_net), 4)]
hh[, value_segment := factor(value_quartile,
levels = 1:4,
labels = c("Value Q1 (Low)", "Value Q2", "Value Q3", "Value Q4 (Top)"))]
hh[, promo_prone_segment := factor(promo_quartile,
levels = 1:4,
labels = c("Promo Low", "Promo Med-Low", "Promo Med-High", "Promo High"))]
fwrite(hh, "outputs/tables/household_segments.csv")
# Attach segments
setkey(hh, household_id)
setkey(tx, household_id)
tx <- hh[tx]
# 5) Segment KPI tables (fixed)
seg_kpis <- function(segcol) {
stopifnot(segcol %in% names(tx))
tx[
,
.(
households = uniqueN(household_id),
trips = uniqueN(basket_id),
net_sales = sum(sales_value, na.rm = TRUE),
gross_sales = sum(gross_sales_est, na.rm = TRUE),
discount = sum(total_disc_amt, na.rm = TRUE),
discount_rate = ifelse(sum(gross_sales_est, na.rm = TRUE) > 0,
sum(total_disc_amt, na.rm = TRUE) / sum(gross_sales_est, na.rm = TRUE),
NA_real_),
promo_net_sales = sum(fifelse(promo_any == 1L, sales_value, 0), na.rm = TRUE),
promo_share_net = ifelse(sum(sales_value, na.rm = TRUE) > 0,
sum(fifelse(promo_any == 1L, sales_value, 0), na.rm = TRUE) / sum(sales_value, na.rm = TRUE),
NA_real_),
coupon_discount = sum(coupon_total_amt, na.rm = TRUE)
),
by = segcol
][order(-net_sales)]
}
k1 <- seg_kpis("value_segment"); fwrite(k1, "outputs/tables/segment_kpis_value_segment.csv")
k2 <- seg_kpis("promo_prone_segment"); fwrite(k2, "outputs/tables/segment_kpis_promo_prone_segment.csv")
# 6) Promo lift by value segment (product-store FE + week FE)
psw <- tx[
,
.(
gross_sales = sum(gross_sales_est, na.rm = TRUE),
discount = sum(total_disc_amt, na.rm = TRUE),
promo_any = as.integer(max(promo_any))
),
by = .(value_segment, product_id, store_id, week)
]
promo_lift <- list()
for (seg in levels(psw$value_segment)) {
d <- psw[value_segment == seg]
m_g_any <- feols(gross_sales ~ promo_any | product_id^store_id + week, data = d, cluster = ~store_id)
m_d_any <- feols(discount ~ promo_any | product_id^store_id + week, data = d, cluster = ~store_id)
net_lift <- as.numeric(coef(m_g_any)["promo_any"]) - as.numeric(coef(m_d_any)["promo_any"])
disc_lift <- as.numeric(coef(m_d_any)["promo_any"])
roi <- ifelse(abs(disc_lift) > 1e-9, net_lift / disc_lift, NA_real_)
promo_lift[[seg]] <- data.frame(
value_segment = seg,
net_lift_promo_any = net_lift,
discount_lift_promo_any = disc_lift,
roi_proxy = roi
)
saveRDS(m_g_any, paste0("outputs/models/promo_", make.names(seg), "_gross_any.rds"))
saveRDS(m_d_any, paste0("outputs/models/promo_", make.names(seg), "_disc_any.rds"))
}
promo_lift_df <- do.call(rbind, promo_lift)
write_csv(promo_lift_df, "outputs/tables/promo_lift_by_value_segment.csv")
p_roi <- ggplot(promo_lift_df, aes(x = value_segment, y = roi_proxy)) +
geom_col() +
labs(
title = "Promo ROI Proxy by Value Segment",
subtitle = "ROI proxy = (gross lift - discount lift) / discount lift (revenue proxy, not margin)",
x = NULL, y = "ROI proxy"
)
ggsave("outputs/figures/promo_roi_proxy_value_segment.png", p_roi, width = 9, height = 4.5, dpi = 160)
# 7) Coupon efficiency by segment (safe column names)
basket <- tx[
,
.(
gross_sales = sum(gross_sales_est, na.rm = TRUE),
coupon_disc = sum(coupon_total_amt, na.rm = TRUE)
),
by = .(household_id, basket_id, store_id, week, value_segment, promo_prone_segment)
]
coupon_eff <- function(segcol) {
out <- list()
seg_levels <- unique(basket[[segcol]])
for (s in seg_levels) {
d <- basket[get(segcol) == s]
m <- feols(gross_sales ~ coupon_disc | household_id + week + store_id,
data = d, cluster = ~household_id)
b <- as.numeric(coef(m)["coupon_disc"])
out[[as.character(s)]] <- data.frame(
segment_var = segcol,
segment_level = as.character(s),
coef_gross_per_1_coupon = b,
net_return_per_1_coupon = b - 1
)
}
rbindlist(out, fill = TRUE)
}
coupon_df <- rbind(
coupon_eff("value_segment"),
coupon_eff("promo_prone_segment")
)
fwrite(coupon_df, "outputs/tables/coupon_efficiency_by_segment.csv")
# 8) Weekly summary + plots (label_number_si2 replacement)
label_number_si2 <- function(...) {
scales::label_number(scale_cut = scales::cut_si(""), ...)
}
weekly <- tx[
,
.(
net_sales = sum(sales_value, na.rm = TRUE),
gross_sales = sum(gross_sales_est, na.rm = TRUE),
discount = sum(total_disc_amt, na.rm = TRUE),
discount_rate = sum(total_disc_amt, na.rm = TRUE) / sum(gross_sales_est, na.rm = TRUE),
promo_net_sales = sum(fifelse(promo_any == 1L, sales_value, 0), na.rm = TRUE),
promo_share_net = sum(fifelse(promo_any == 1L, sales_value, 0), na.rm = TRUE) / sum(sales_value, na.rm = TRUE)
),
by = .(week)
][order(week)]
fwrite(weekly, "outputs/tables/weekly_summary.csv")
p_week_sales <- ggplot(as.data.frame(weekly), aes(x = week, y = net_sales)) +
geom_line() +
labs(title = "Weekly Net Sales", x = "Week", y = "Net sales ($)") +
scale_y_continuous(labels = label_number_si2())
ggsave("outputs/figures/weekly_net_sales.png", p_week_sales, width = 9, height = 4, dpi = 160)
p_week_disc <- ggplot(as.data.frame(weekly), aes(x = week, y = discount_rate)) +
geom_line() +
labs(title = "Weekly Discount Rate (Discount / Gross Sales)", x = "Week", y = "Discount rate") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
ggsave("outputs/figures/weekly_discount_rate.png", p_week_disc, width = 9, height = 4, dpi = 160)
# Over-discount flags
promo_flags <- promo_lift_df %>%
mutate(over_discount_flag = (discount_lift_promo_any > 0 & roi_proxy <= 0))
write_csv(promo_flags, "outputs/tables/promo_overdiscount_flags.csv")
coupon_flags <- as.data.table(coupon_df)
coupon_flags[, over_discount_flag := (net_return_per_1_coupon < 0)]
fwrite(coupon_flags, "outputs/tables/coupon_overdiscount_flags.csv")