Introduction

Promotional campaigns are a cornerstone of retail marketing, designed to attract customers, boost sales, and strengthen brand loyalty. Yet, variations in how customer segments respond to these promotions can lead to inefficient marketing spend and missed revenue opportunities. Regork’s leadership must understand which demographic groups show the strongest response to promotional activities to optimize resource allocation and maximize ROI.

This analysis addresses the business problem by examining transactional, promotional, and demographic data from the Complete Journey dataset. Through data wrangling and exploratory data analysis (EDA), key trends were identified using visualizations such as bar charts, heatmaps, and correlation matrices. The investigation revealed that middle-income households exhibit the highest coupon savings, while younger and certain older households demonstrate a stronger response to promotional displays.

Based on these insights, the proposed solution is to reallocate and tailor Regork’s promotional strategies. Specifically, promotional budgets should be concentrated on middle-income segments through coupon-based initiatives and customized marketing messages for younger and older demographic groups. This targeted approach promises to enhance customer engagement, drive sales growth, and improve overall marketing efficiency.

Packages Required

suppressWarnings(suppressMessages(library(completejourney)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(lubridate)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(corrplot)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(naniar)))

Below is a list of the packages loaded in this session with a brief description of each:

Data Preparation

In this section, I prepared the datasets for analysis by cleaning, transforming, and merging key data sources. The main goal was to ensure that transactions are properly linked to promotions and demographics so I can analyze how different customer segments respond to promotional campaigns.

Steps Taken in Data Preparation:

1. Loaded Data

I imported datasets from the Complete Journey package, including: transactions: Contains all grocery store transactions. promotions: Includes details about in-store promotions and mailer campaigns. demographics: Provides household-level characteristics (age, income, etc.). products: Contains product categories (optional for later use).

2. Merged Promotions Data with Transactions

A left join was performed to match transactions with promotions using product_id, store_id, and week. I assumed that if display_location or mailer_location was “0”, the product was not promoted. A new promotion flag (is_promoted) was created: 1 means The transaction involved a promoted product. 0 means The transaction involved a non-promoted product.

3. Merged Demographics Data with Transactions

A left join was used to attach household demographic information (age, income, marital_status, etc.) to each transaction. This ensures that we retain transactions even if some households have missing demographic data.

4. Created Date Variables for Time-Based Analysis

Extracted the transaction date from the timestamp. Created transaction_month and transaction_quarter to allow for time-series analysis.

Final Outcome

At the end of this process, I obtained a cleaned and structured dataset (transactions_full) that is ready for Exploratory Data Analysis (EDA).

# DATA PREPARATION


# 1. Load the CompleteJourney datasets 
transactions <- get_transactions()
promotions   <- get_promotions()
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
# 2. Merge transactions and promotions
transactions_promo <- transactions %>%
  left_join(promotions, by = c("product_id", "store_id", "week"), relationship = "many-to-many") %>%
  mutate(
    is_promoted = if_else(
      (!is.na(display_location) & display_location != "0") |
      (!is.na(mailer_location) & mailer_location != "0"),
      1, 0
    )
  )


# 3. Merge in demographic data using left join
transactions_full <- transactions_promo %>%
  left_join(demographics, by = "household_id") %>%
# 4. Create time-related variables 
  mutate(
    transaction_date    = as.Date(transaction_timestamp),
    transaction_month   = floor_date(transaction_date, "month"),
    transaction_quarter = quarter(transaction_date, type = "year.quarter")
  )
transactions_full
## # A tibble: 1,470,002 × 24
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,992 more rows
## # ℹ 17 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>, display_location <fct>,
## #   mailer_location <fct>, is_promoted <dbl>, age <ord>, income <ord>,
## #   home_ownership <ord>, marital_status <ord>, household_size <ord>,
## #   household_comp <ord>, kids_count <ord>, transaction_date <date>,
## #   transaction_month <date>, transaction_quarter <dbl>

Exploratory Data Analysis

# Summarize transaction counts by age and promotion flag
promo_by_age <- transactions_full %>%
  group_by(age, is_promoted) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(is_promoted = factor(is_promoted, levels = c(0, 1),
                                labels = c("Not Promoted", "Promoted")))

# Plot grouped bar chart by age
ggplot(promo_by_age, aes(x = age, y = count, fill = is_promoted)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Promotion Participation by Age Group",
       x = "Age Group",
       y = "Number of Transactions",
       fill = "Promotion Status") +
  theme_minimal()

Analysis of Promotion Participation by Age Group

This bar chart visualizes the number of transactions across age groups, distinguishing between promoted and non-promoted purchases. The 45-54 age group has the highest transaction volume, followed by 35-44 and 25-34, suggesting they are the most active shoppers. Promotions appear to have limited engagement across all age groups, as non-promoted transactions significantly outweigh promoted ones.

Notably, the 65+ age group has fewer transactions overall, which may indicate lower shopping frequency or less interest in promotions. Additionally, a large NA category suggests missing age data, potentially limiting the accuracy of demographic-based insights.

This graph helps identify which age groups respond best to promotions. Since middle-aged shoppers (35-54) show the highest transaction volume, tailoring promotions to this group could maximize effectiveness. Meanwhile, older shoppers may require different promotional strategies.

# 1. Filter and Transform Data
#    - Remove any transactions with non-positive sales_value (if any).
#    - Create a new variable log_sales.

transactions_log <- transactions_full %>%
  filter(!is.na(sales_value), sales_value >= 0) %>%
  mutate(log_sales = log(sales_value + 1))


# 2. Boxplot of Log-Transformed Sales by Promotion Status

ggplot(transactions_log, aes(
  x = factor(is_promoted, levels = c(0, 1), labels = c("Not Promoted", "Promoted")),
  y = log_sales
)) +
  geom_boxplot() +
  labs(
    title = "Log-Transformed Sales Value Distribution by Promotion Status",
    x = "Promotion Status",
    y = "Log(Sales Value + 1)"
  ) +
  theme_minimal()

Analysis of Log-Transformed Sales Value Distribution by Promotion Status

This boxplot (with log-transformed sales values) presents a clearer view of the sales distribution for promoted and non-promoted transactions.

The graph shows that sales value distributions are nearly identical for promoted and non-promoted transactions. The median and spread are very similar, suggesting that promotions do not significantly impact individual transaction spending. This implies that promotions may not encourage customers to spend more per transaction.

transactions_full <- transactions_full %>%
  filter(year(transaction_date) == 2017)

monthly_sales <- transactions_full %>%
  group_by(transaction_month, is_promoted) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  mutate(is_promoted = factor(is_promoted, levels = c(0, 1),
                                labels = c("Not Promoted", "Promoted")))

ggplot(monthly_sales, aes(x = transaction_month, y = total_sales, color = is_promoted)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(title = "Monthly Sales Trends by Promotion Status",
       x = "Month",
       y = "Total Sales Value",
       color = "Promotion Status") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme_minimal()

Analysis of Monthly Sales Trends by Promotion Status

This line chart tracks total sales for promoted and non-promoted transactions across 2017. The data shows that non-promoted sales consistently outperform promoted sales, maintaining a stable trend with minor fluctuations. Promoted sales remain relatively flat throughout the year, suggesting that promotions do not significantly shift overall spending patterns.

This graph reinforces that promotions do not drastically increase total sales, though their impact on specific customer segments or purchase behavior (e.g., quantity purchased per transaction) still needs further investigation. Next, we analyze whether promotions influence the number of items purchased per transaction to see if customers respond to promotions by buying more.

ggplot(transactions_full, aes(x = sales_value, y = quantity,
                              color = factor(is_promoted, levels = c(0, 1),
                                             labels = c("Not Promoted", "Promoted")))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ age) +
  labs(title = "Promotion Impact on Quantity Purchased by Age Group",
       x = "Sales Value",
       y = "Quantity Purchased",
       color = "Promotion Status") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Analysis of Promotion Impact on Quantity Purchased by Age Group

This scatter plot with trend lines examines the relationship between sales value and quantity purchased, segmented by age group and promotion status. The visualization reveals extreme outliers, where some transactions report unrealistically high quantities (e.g., over 50,000 units). These extreme values skew the trend lines, making the data difficult to interpret.

Despite the noise, the trend lines suggest a strong positive correlation between sales value and quantity purchased, meaning that higher spending generally corresponds to larger quantities. However, no clear distinction is observed between promoted and non-promoted transactions, indicating that promotions do not significantly alter this relationship across age groups.

avg_trans <- transactions_full %>%
  group_by(income, is_promoted) %>%
  summarise(avg_sales = mean(sales_value, na.rm = TRUE),
            sd_sales  = sd(sales_value, na.rm = TRUE),
            n = n(), .groups = "drop") %>%
  mutate(se = sd_sales / sqrt(n),
         is_promoted = factor(is_promoted, levels = c(0, 1),
                              labels = c("Not Promoted", "Promoted")))

ggplot(avg_trans, aes(x = income, y = avg_sales, fill = is_promoted)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(aes(ymin = avg_sales - se, ymax = avg_sales + se),
                position = position_dodge(width = 0.9), width = 0.25) +
  labs(title = "Average Transaction Value by Income and Promotion Status",
       x = "Income Group",
       y = "Average Sales Value",
       fill = "Promotion Status") +
  theme_minimal()

Analysis of Average Transaction Value by Income and Promotion Status

This bar chart with error bars compares the average transaction value across different income groups, distinguishing between promoted and non-promoted transactions.

Across all income groups, non-promoted transactions consistently have a higher average sales value than promoted transactions. The gap is more pronounced in higher-income groups ($125K+), suggesting that wealthier customers tend to spend more per transaction, even without promotions. Lower-income groups exhibit smaller differences, implying that promotions may be slightly more effective in influencing their spending.

This finding supports the idea that promotions may not necessarily increase the amount spent per transaction, especially for higher-income groups.

promo_rate <- transactions_full %>%
  group_by(age, household_size) %>%
  summarise(promo_rate = mean(is_promoted, na.rm = TRUE), .groups = "drop")

ggplot(promo_rate, aes(x = age, y = factor(household_size), fill = promo_rate)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Heatmap of Promotional Responsiveness",
       x = "Age Group",
       y = "Household Size",
       fill = "Promotion Rate") +
  theme_minimal()

Analysis of Promotional Responsiveness Across Age Groups and Household Sizes

This heatmap visualizes the promotion rate (the proportion of transactions that were promoted) across different age groups and household sizes. Darker blue shades indicate higher promotional responsiveness.

Key insights:

This analysis helps identify key target groups for future promotions. Younger families (19-24, household size 4) and older adult households (55-64, larger sizes) seem to be more promotion-responsive, suggesting opportunities for tailored marketing strategies.

promo_marital <- transactions_full %>%
  group_by(marital_status, is_promoted) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(is_promoted = factor(is_promoted, levels = c(0, 1),
                                labels = c("Not Promoted", "Promoted")))

ggplot(promo_marital, aes(x = marital_status, y = count, fill = is_promoted)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Promotion Distribution by Marital Status",
       x = "Marital Status",
       y = "Transaction Count",
       fill = "Promotion Status") +
  theme_minimal()

Analysis of Promotion Distribution by Marital Status

This bar chart compares the number of promoted vs. non-promoted transactions across different marital status categories. The NA category is unusually large, indicating a significant portion of transactions lack marital status data.

Both married and unmarried shoppers engage less with promotions, with non-promoted transactions dominating in both groups. The small difference in promotion engagement between married and unmarried customers suggests that marital status alone may not be a strong predictor of promotional responsiveness.

coupon_savings_plot <- transactions_full %>%
  filter(is_promoted == 1) %>%
  mutate(coupon_savings = coupon_disc + coupon_match_disc) %>%
  group_by(income) %>%
  summarise(avg_coupon_savings = mean(coupon_savings, na.rm = TRUE), .groups = "drop")

ggplot(coupon_savings_plot, aes(x = income, y = avg_coupon_savings)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(title = "Average Coupon Savings in Promoted Transactions by Income",
       x = "Income Group",
       y = "Average Coupon Savings") +
  theme_minimal()

Analysis of Average Coupon Savings in Promoted Transactions by Income Group

This bar chart displays the average coupon savings for promoted transactions across different income groups. The values show how much customers save through coupon-based promotions. The 75K-99K income group has the highest average coupon savings, suggesting that middle-income earners engage more with coupon promotions. Lower-income groups (<50K) have relatively lower coupon savings, indicating that either they use fewer coupons or that the coupons available to them result in smaller savings. The highest-income group (250K+) shows the lowest coupon savings, which suggests that wealthier customers are less likely to use coupons or that their purchases are less influenced by discounts.

This visualization supports the idea that coupon promotions are most effective for middle-income groups (75K-99K) but may not be as impactful for lower- or higher-income customers. This suggests an opportunity to optimize coupon distribution by targeting groups that show high responsiveness.

# Select key numeric variables and compute coupon savings
numeric_vars <- transactions_full %>%
  select(sales_value, quantity, coupon_disc, coupon_match_disc) %>%
  mutate(coupon_savings = coupon_disc + coupon_match_disc)

# Compute the correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Plot the correlation matrix using corrplot
corrplot::corrplot(cor_matrix, method = "color", type = "upper",
                   tl.col = "black", addCoef.col = "black", number.cex = 0.7)

Analysis of Correlation Matrix of Key Variables

This correlation heatmap displays relationships between key numerical variables, including sales value, quantity purchased, and coupon-related discounts. The color scale represents the strength and direction of correlations, with darker blue indicating strong positive correlations and lighter shades representing weaker or near-zero relationships.

Key Insights:

This heatmap reinforces earlier insights: promotions and coupon discounts do not strongly impact spending behavior, and their effectiveness should be assessed based on purchase frequency or customer engagement rather than transaction size.

Summary

Problem Statement

The problem statement was “Which demographic groups are most responsive to promotions, and how can Regork tailor marketing efforts to boost sales among under-engaged segments?” The goal was to identify where promotional strategies (in-store displays, coupons, mailers) are most effective and to highlight any potential opportunities for optimization.

Approach and Methodology

I combined the Complete Journey datasets—including transactions, promotions, and demographics—into a single analytic view, ensuring that each transaction was labeled with relevant demographic attributes (e.g., age, income) and promotion status (whether it was associated with a display, mailer, or coupon). I then performed Exploratory Data Analysis (EDA), using descriptive statistics, boxplots, heatmaps, bar charts, and correlation plots to compare transaction behaviors across different segments (e.g., age, household size, marital status, income).

Key Insights

Promotional Engagement Is Consistently Lower Than Non-Promoted

Implications and Recommendations

Limitations and Future Improvements

Overall, the findings suggest that promotions alone do not dramatically increase per-transaction spending, but they do resonate with certain demographics—particularly middle-income or specific household-size groups. By refining promotional targeting and exploring new tactics for high-income segments, Regork can more efficiently allocate marketing resources for sustained revenue growth.