Final Project - Group 6

Introduction

Business Problem

Promotions are one of Regork’s largest controllable levers for driving short-term sales, but they also carry real costs: limited prime display space, ad “real estate,” and the opportunity cost of promoting one product over another. The CEO should care because not all promotion placements are equally effective some placements may meaningfully lift demand, while others may produce little benefit or even correlate with lower sales.

This analysis addresses a practical question that directly impacts revenue strategy and merchandising execution:

For Regork’s highest-selling products, which in-store display locations and mailer placements are associated with the largest lift in weekly sales (revenue and units), compared to weeks with no display/no ad controlling for product and store?

How we addressed it (data + methodology)

We used two CompleteJourney datasets:

Transactions: what customers bought, including weekly sales and units
Promotions: where products were promoted (display + mailer placement codes)

Methodologically, we:

Focused on the most recent 13 weeks to analyze recent promotion performance
Selected the Top 20 products by revenue to prioritize high-impact items
Aggregated sales to product × store × week (matching how promotions are defined)
Collapsed multiple promotion records into a single “most prominent placement” per product-store-week using placement priority rules
Estimated lift by comparing promoted weeks to a baseline of the same product in the same store during weeks with no display and not on ad (controls for product and store)

How this helps??

This analysis helps leadership decide where to invest premium promotional real estate. The proposed solution is to:

Prioritize placements that show consistent positive lift (especially those with enough evidence/observations)
Use these placements for the Top 20 products to maximize sales lift
Avoid or investigate placements that show negative or weak association with sales

Packages Required

The following R packages are required in order to run the code in this R project:

library(completejourney)  # dataset tables (transactions/products/promotions)
library(tidyverse)        # data wrangling + plotting toolkit
library(dplyr)            # data manipulation + joins
library(stringr)          # string/text cleaning
library(ggplot2)          # charts/visualizations
library(scales)           # axis/label formatting (percent, dollar, comma)
library(knitr)            # R Markdown + kable tables
library(kableExtra)       # styled/pretty tables

Data Preparation

Step 1 — Define analysis window + top products

We narrowed the analysis to the last 13 weeks and the Top 20 products by revenue. This ensures:

Results reflect recent promotion behavior
Findings focus on products with meaningful business impact

Output: top_products_tbl confirms which products drive the most revenue in the chosen window.

transactions <- get_transactions()
promotions  <- get_promotions()

# BUSINESS PARAMETERS (dynamic)
n_weeks <- 13          
top_n_products <- 20   

# Baseline rule: at least 30% of window weeks or minimum 4
baseline_min_weeks <- max(4, floor(n_weeks * 0.3))

# Identify the most recent 13 unique weeks in the data
weeks_keep <- transactions %>%
  distinct(week) %>%
  arrange(desc(week)) %>%
  slice_head(n = n_weeks)
# Filter transactions to only those selected weeks
tx_13 <- transactions %>%
  semi_join(weeks_keep, by = "week")

top_products <- tx_13 %>%
  group_by(product_id) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_sales)) %>%
  slice_head(n = 20) %>%
  select(product_id)

Step 2 — Create a product-store-week sales table

We aggregated transactions to product × store × week, producing:

weekly_sales (revenue)
weekly_units (volume)

This grain is critical because promotions are defined at the same level.

Output: tx_psw becomes the foundation for lift measurement.

tx_psw <- tx_13 %>%
  semi_join(top_products, by = "product_id") %>% # keep only top 20 products
  group_by(product_id, store_id, week) %>%      # grain: product-store-week
  summarise(
    weekly_sales = sum(sales_value, na.rm = TRUE),
    weekly_units = sum(quantity, na.rm = TRUE),
    .groups = "drop"
  )

Step 3 — Join promotions correctly (avoid row explosion)

The promotions table can have multiple rows per product-store-week. To avoid duplicated joins and inflated metrics, we:

filtered promotions to only keys appearing in tx_psw
collapsed promotions into one record per product-store-week
selected the most prominent placement using priority rules

We also created interpretable flags and intensity measures:

has_display, has_mailer, promo_any
n_display_rows, n_mailer_rows

Output: promo_psw is joined to create tx_joined.

#Keys = All product-store-week combination in our aggregated transactions
keys <- tx_psw %>% distinct(product_id, store_id, week)

promo_filtered <- promotions %>% # only those promotions that match those keys
  semi_join(keys, by = c("product_id", "store_id", "week"))

display_priority <- c("0","A","7","9","2","1","6","5","4","3")
mailer_priority  <- c("0","C","A","P","J","F","L","X","Z","H","D")

display_labels <- c(
  "0"="No display",
  "1"="Store front",
  "2"="Store rear",
  "3"="Front end cap",
  "4"="Mid-aisle end cap",
  "5"="Rear end cap",
  "6"="Side aisle end cap",
  "7"="In-aisle",
  "9"="Secondary location display",
  "A"="In-shelf"
)

mailer_labels <- c(
  "0"="Not on ad",
  "A"="Interior page feature",
  "C"="Interior page line item",
  "D"="Front page feature",
  "F"="Back page feature",
  "H"="Wrap front feature",
  "J"="Wrap interior coupon",
  "L"="Wrap back feature",
  "P"="Interior page coupon",
  "X"="Free on interior page",
  "Z"="Free on front/back/wrap"
)

promo_psw <- promo_filtered %>%
  mutate(
    display_location = as.character(display_location),
    mailer_location  = as.character(mailer_location),
    display_location = if_else(is.na(display_location) | display_location == "", "0", display_location),
    mailer_location  = if_else(is.na(mailer_location)  | mailer_location  == "", "0", mailer_location)
  ) %>%
  group_by(product_id, store_id, week) %>%
  summarise(
    # pick most prominent by priority ordering
    display_loc = display_location[which.max(match(display_location, display_priority, nomatch = 0))],
    mailer_loc  = mailer_location[ which.max(match(mailer_location,  mailer_priority,  nomatch = 0))],

    # flags
    has_display = as.integer(any(display_location != "0")),
    has_mailer  = as.integer(any(mailer_location  != "0")),
    promo_any   = as.integer(any(display_location != "0" | mailer_location != "0")),

    # intensity
    n_display_rows = sum(display_location != "0"),
    n_mailer_rows  = sum(mailer_location  != "0"),

    .groups = "drop"
  ) %>%
  mutate(
    display_label = display_labels[display_loc],
    mailer_label  = mailer_labels[mailer_loc]
  )

tx_joined <- tx_psw %>%
  left_join(promo_psw, by = c("product_id", "store_id", "week")) %>%
  mutate(
    display_loc   = coalesce(display_loc, "0"),
    mailer_loc    = coalesce(mailer_loc,  "0"),
    display_label = coalesce(display_label, "No display"),
    mailer_label  = coalesce(mailer_label,  "Not on ad"),
    has_display   = coalesce(has_display, 0L),
    has_mailer    = coalesce(has_mailer,  0L),
    promo_any     = coalesce(promo_any,   0L),
    n_display_rows = coalesce(n_display_rows, 0L),
    n_mailer_rows  = coalesce(n_mailer_rows,  0L)
  )

Exploratory Data Analysis

EDA goal

This EDA is not simply “plot the raw data.” We engineered analytical features by combining transactions and promotions to uncover patterns that are not visible in either dataset alone.

Step 4 — Lift definition (key insight mechanism)

To control for product and store, we computed a baseline within each product-store combination.

Baseline weeks satisfy:

No display (display_loc == “0”)
Not on ad (mailer_loc == “0”)

We required at least 4 baseline weeks for stability, preventing lift estimates from being driven by very small baseline samples.

baseline_min_weeks <- max(4, floor(n_weeks * 0.3))

tx_lift <- tx_joined %>%
  group_by(product_id, store_id) %>%
  mutate(
    baseline_weeks = sum(display_loc == "0" & mailer_loc == "0", na.rm = TRUE),
    baseline_sales = mean(weekly_sales[display_loc == "0" & mailer_loc == "0"], na.rm = TRUE),
    baseline_units = mean(weekly_units[display_loc == "0" & mailer_loc == "0"], na.rm = TRUE)
  ) %>%
  ungroup() %>%
  filter(
    baseline_weeks >= baseline_min_weeks,
    !is.na(baseline_sales), baseline_sales > 0,
    !is.na(baseline_units), baseline_units > 0
  ) %>%
  mutate(
    lift_sales_pct = (weekly_sales - baseline_sales) / baseline_sales,
    lift_units_pct = (weekly_units - baseline_units) / baseline_units
  )

Step 5 — Core EDA outputs (tables + plots)

We produced:

Ranked table: Display location → avg sales lift, avg unit lift, n_obs
Ranked table: Mailer placement → avg sales lift, avg unit lift, n_obs
Two plots visualizing average lift by placement

These outputs reveal “placement effectiveness” patterns that are not self-evident in the raw data.

Key patterns observed (with sample-size caveat)

Display placements

Some display locations show extremely high lift but have very low n_obs (ex: mid-aisle end cap with n = 1). These should be treated as hypotheses, not major recommendations.
Front end cap shows a smaller positive lift with much stronger support (ex: n = 65), making it more believable.
Secondary location display shows negative lift in the observed window, suggesting it may be less effective or used in weaker demand conditions.

display_impact <- tx_lift %>%
  group_by(display_loc, display_label) %>%
  summarise(
    avg_lift_sales_pct = mean(lift_sales_pct, na.rm = TRUE),
    avg_lift_units_pct = mean(lift_units_pct, na.rm = TRUE),
    n_obs = n(),
    .groups = "drop"
  )  %>% slice_max(order_by = avg_lift_sales_pct, n = 6)

display_impact

## # A tibble: 6 × 5
##   display_loc display_label      avg_lift_sales_pct avg_lift_units_pct n_obs
##   <chr>       <chr>                           <dbl>              <dbl> <int>
## 1 4           Mid-aisle end cap              2.82                5         1
## 2 6           Side aisle end cap             1.27                1.92      3
## 3 2           Store rear                     0.677               1.18     10
## 4 1           Store front                    0.116               0.121     9
## 5 7           In-aisle                       0.0934              0.536     9
## 6 3           Front end cap                  0.0914              0.224    65

Mailer placements

Wrap front feature (H) and interior page feature (A) show large lift with moderate sample sizes (~50 observations).
Front page feature (D) shows smaller lift but has a very large sample size (ex: n = 824), making it a reliable “workhorse” placement.
“Free on interior page (X)” shows negative lift, which may reflect cannibalization or context effects.

This is strong EDA because it reports both signal and evidence strength (n_obs).

mailer_impact <- tx_lift %>%
  group_by(mailer_loc, mailer_label) %>%
  summarise(
    avg_lift_sales_pct = mean(lift_sales_pct, na.rm = TRUE),
    avg_lift_units_pct = mean(lift_units_pct, na.rm = TRUE),
    n_obs = n(),
    .groups = "drop"
  ) %>%
  slice_max(order_by = avg_lift_sales_pct, n = 6)

mailer_impact

## # A tibble: 6 × 5
##   mailer_loc mailer_label          avg_lift_sales_pct avg_lift_units_pct n_obs
##   <chr>      <chr>                              <dbl>              <dbl> <int>
## 1 H          Wrap front feature              0.849              1.95        54
## 2 A          Interior page feature           0.652              1.12        50
## 3 F          Back page feature               0.266              0.438       42
## 4 D          Front page feature              0.227              0.706      824
## 5 0          Not on ad                       0.000523           0.000361  5059
## 6 X          Free on interior page          -0.134             -0.158      123

subtitle_text <- paste0(

  "Lift vs baseline (no display + no mailer), last ", n_weeks,

  " weeks, top ", top_n_products, " products"

)

display_plot <- display_impact %>%

  mutate(

    lift_pct = avg_lift_sales_pct,

    label_txt = paste0(percent(lift_pct, accuracy = 0.1), "  |  n=", comma(n_obs))

  )



ggplot(display_plot,

       aes(x = reorder(display_label, lift_pct), y = lift_pct, fill = lift_pct)) +

  geom_col(width = 0.75, alpha = 0.95, color = "white", linewidth = 0.6) +

  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.9, color = "gray35") +

  geom_text(

    aes(label = label_txt),

    hjust = ifelse(display_plot$lift_pct >= 0, -0.10, 1.10),

    size = 4.1,

    fontface = "bold",

    color = "#1f1f1f"

  ) +

  coord_flip() +

  scale_y_continuous(

    labels = percent_format(accuracy = 1),

    expand = expansion(mult = c(0.06, 0.28))

  ) +

  scale_fill_gradient2(

    low = "#FF4D6D",   # vibrant negative

    mid = "#F4F4F4",   # neutral around 0

    high = "#00B4D8",  # vibrant positive

    midpoint = 0,

    labels = percent_format(accuracy = 1),

    name = "Lift"

  ) +

  labs(

    title = "Average Sales Lift by Display Location (Top 5)",

    subtitle = subtitle_text,

    x = NULL,

    y = "Avg Sales Lift"

  ) +

  theme_minimal(base_size = 14) +

  theme(

    plot.title = element_text(face = "bold", size = 18, color = "#243665"),

    plot.subtitle = element_text(size = 12, color = "gray30"),

    axis.text.y = element_text(face = "bold"),

    legend.position = "right",

    panel.grid.major.y = element_blank()

  )

mailer_plot <- mailer_impact %>%

  mutate(

    lift_pct = avg_lift_sales_pct,

    label_txt = paste0(percent(lift_pct, accuracy = 0.1), "  |  n=", comma(n_obs))

  )



ggplot(mailer_plot,

       aes(x = reorder(mailer_label, lift_pct), y = lift_pct, fill = lift_pct)) +

  geom_col(width = 0.75, alpha = 0.95, color = "white", linewidth = 0.6) +

  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.9, color = "gray35") +

  geom_text(

    aes(label = label_txt),

    hjust = ifelse(mailer_plot$lift_pct >= 0, -0.10, 1.10),

    size = 4.1,

    fontface = "bold",

    color = "#1f1f1f"

  ) +

  coord_flip() +

  scale_y_continuous(

    labels = percent_format(accuracy = 1),

    expand = expansion(mult = c(0.06, 0.28))

  ) +

  scale_fill_gradient2(

    low = "#FF4D6D",

    mid = "#F4F4F4",

    high = "#00B4D8",

    midpoint = 0,

    labels = percent_format(accuracy = 1),

    name = "Lift"

  ) +

  labs(

    title = "Average Sales Lift by Mailer Placement (Top 5)",

    subtitle = subtitle_text,

    x = NULL,

    y = "Avg Sales Lift"

  ) +

  theme_minimal(base_size = 14) +

  theme(

    plot.title = element_text(face = "bold", size = 18, color = "#243665"),

    plot.subtitle = element_text(size = 12, color = "gray30"),

    axis.text.y = element_text(face = "bold"),

    legend.position = "right",

    panel.grid.major.y = element_blank()

  )

### Glimpse of Sales of top Products

top_products_tbl <- tx_13 %>%
  semi_join(top_products, by = "product_id") %>%
  group_by(product_id) %>%
  summarise(total_sales_13w = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_sales_13w))

top_products_tbl

## # A tibble: 20 × 2
##    product_id total_sales_13w
##    <chr>                <dbl>
##  1 6534178             67676.
##  2 6533889              6958.
##  3 1029743              6668.
##  4 6533765              5016.
##  5 6534166              4610.
##  6 1106523              4571.
##  7 1082185              4180 
##  8 916122               4045.
##  9 844179               3096.
## 10 995242               3000.
## 11 1044078              2418.
## 12 5569230              2400.
## 13 1126899              2351.
## 14 1070820              2105.
## 15 899624               2022.
## 16 1005186              1903.
## 17 1127831              1882.
## 18 866211               1803.
## 19 854852               1706.
## 20 948670               1685.

Summary

Problem recap

We evaluated which promotional placements (display and mailer) are most associated with sales lift for Regork’s highest-selling products.

Method recap

Using CompleteJourney transactions and promotions:

restricted to the last 13 weeks
selected Top 20 products by revenue
aggregated to product-store-week
collapsed promotions to “most prominent placement”
computed lift relative to within product-store baseline weeks with no promotion exposure

Interesting insights

High-visibility mailer placements (wrap/front features) are associated with higher lift
Front page placements show more modest lift but have strong evidence (large n_obs)
Some placements show negative associations and may be poor candidates for revenue-driven campaigns

Recommendations (CEO-facing)

Use Wrap front (H) and Interior feature (A) for Top 20 products when maximizing lift is the goal
Keep Front page feature (D) as a scalable default placement because evidence is strongest
Prioritize Front end caps for displays due to positive lift with meaningful support
Treat results with very low n_obs as hypotheses—collect more weeks or expand products before policy decisions
Investigate placements with negative association (ex: “Free on interior page (X)”)

Limitations + improvements

This analysis is associational, not causal (placements are not randomized)
Some placements have low sample size in the 13-week window
Improvements:
- extend to more weeks and/or more products
- run a regression with week fixed effects (controls for seasonality)
- incorporate price/discount variables to separate placement effect from price effect