Introduction

Business problem: The coupon policy

In this project, I aim to investigate the trend of sales & coupon usage rate and determine whether the coupon usage rate plays a significant role in boosting sales value across major departments and different types of households. The coupon usage rate is defined as the percentage of transactions that used at least one coupon during a specific time period. If higher coupon usage rate indeed lead to higher sales value, then Regork should consider adopting a more aggressive coupon strategy to further enhance sales among specific customer groups. On the other hand, if the coupon policy does not yield a positive impact on sales (i.e., the coupon usage rate is not positively correlated with sales value), Regork should reconsider or suspend the policy to prevent potential profit losses.

Method

First, I present a pie chart to illustrate which product categories constitute Regork’s core business and examine their monthly sales patterns.
Next, I classify households based on their demographic characteristics and identify (using Z-score analysis for each characteristic) which customer segments serve as the major consumers within each feature group.
Focusing on the key product categories — Grocery and Drug GM — I further analyze the monthly consumption trends of these major consumer groups. Finally, for each previously defined customer segment, I conduct unit root tests to confirm the stationarity of both sales and coupon usage rate data. Scatter plots and time series regressions are then used to evaluate whether these two variables are significantly correlated within the major product categories (Grocery and Drug GM). This analysis provides insights for decision-makers on whether differentiated coupon strategies should be implemented for different customer segments (especially for major customers).

1. Setup & Data Preparation

# Libraries
library(completejourney)  # Retail transaction dataset (households, products, transactions)
library(ggplot2)          # Data visualization using Grammar of Graphics
library(dplyr)            # Data manipulation (filter, select, mutate, summarise, joins)
library(tidyr)            # Data tidying and reshaping (nest, unnest)
library(lubridate)        # Date-time manipulation (month, quarter, week extraction)
library(forcats)          # Factor (categorical variable) handling (reorder, recode NA)
library(scales)           # Scale functions for formatting axis labels (percent, number)
library(rlang)            # Non-standard evaluation for tidyverse programming (enquo, !!)
library(purrr)            # Functional programming tools (map, map2, iteration)
library(tseries)          # Time series analysis (PP test, ADF test for stationarity)
library(broom)            # Tidy statistical model outputs (tidy, glance for regression)

# Load and prepare data
transactions <- get_transactions() %>%
  mutate(household_id = as.character(household_id), 
         product_id = as.character(product_id))

products <- products %>%
  mutate(product_id = as.character(product_id)) 

demographics <- demographics %>%
  mutate(
    household_id = as.character(household_id),
    home_ownership = fct_explicit_na(home_ownership, "Unknown"), #Replace NA values in categorical variables with "Unknown" for complete analysis
    marital_status = fct_explicit_na(marital_status, "Unknown")
  )

# Create main dataset: combine all data sources into single comprehensive dataset
df_demo <- transactions %>%
  inner_join(products, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp, label = TRUE))

The above packages are utilized in this final project. The transactions dataset is used to examine weekly sales performance, while the products dataset connects each transaction to its corresponding product department. The demographics dataset provides individual household characteristics that match the households appearing in the transactions data. After merging these datasets, the resulting df_demo serves as the primary dataset used for all subsequent analyses in this project.

2. Department Analysis: Identify top-performing departments and visualize their contribution and seasonality to total sales

Question: Which product departments drive the most revenue?

# Aggregate total sales by department, identify top 8 (top 8 departments represent ~80% of total sales)
dept_top <- transactions %>%
  inner_join(products, by = "product_id") %>%
  count(department, wt = sales_value, name = "total_sales") %>%
  slice_max(total_sales, n = 8) %>%
  mutate(
    pct = total_sales / sum(total_sales),
    dept_group = fct_reorder(department, total_sales, .desc = TRUE),
    legend_lab = paste0(dept_group, " ", percent(pct, accuracy = 0.1))
  )

# Select Top 2 Departments for later Detailed Analysis (for computational efficiency)
dept2 <- dept_top$dept_group[1:2]

For analytical clarity, I focus on the top eight product categories that together account for more than 80% of Regork’s total sales revenue.
Among these categories, Grocery and Drug GM alone contribute roughly two-thirds of the sales within this group and over half of Regork’s overall revenue. Therefore, I recommend that the CEO place particular emphasis on analyzing how to sustain — and potentially increase — the sales performance of these two key product categories.

Visual Comparison of Relative Department Importance

# Pie chart
ggplot(dept_top, aes(x = "", y = pct, fill = dept_group)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  scale_fill_discrete(
    breaks = levels(dept_top$dept_group),
    labels = dept_top$legend_lab
  ) +
  labs(title = "Share of Total Sales by Department (Top 8)", fill = NULL) +
  theme_void() +
  theme(legend.position = "right", legend.text = element_text(size = 12))

As shown in the pie chart, Grocery and Drug GM are the two largest product categories. Other categories such as Fuel, Produce, and Meat also each account for more than 7% of the total share among the top eight products. However, for the sake of analytical focus and clarity, this project concentrates specifically on the Grocery and Drug GM categories for deeper investigation.

Monthly Sales Trends by Department: Identify Seasonal Patterns (e.g., Holiday Spikes)

# Monthly sales by department
df_demo %>%
  summarise(total_sales = sum(sales_value), .by = c(month, department)) %>%
  ggplot(aes(month, total_sales, fill = department)) +
  geom_col() +
  facet_wrap(~department, scales = "free_y") +
  labs(title = "Monthly Sales by Product Department",
       x = "Month", y = "Total Sales") +
  theme_minimal() +
  theme(legend.position = "none")

Before proceeding with the detailed analysis, I first examine the monthly sales trends across all product categories. Overall, the results suggest a clear “year-end effect,” as total sales tend to rise noticeably toward the end of the year. This pattern may be attributed to the Christmas and winter holiday seasons, during which consumers are likely to increase their spending for celebrations. It could also be influenced by year-end promotional campaigns, both of which contribute to the overall increase in Q4 sales compared to the other three quarters.

3. Demographic Analysis Functions: Create reusable functions for demographic analysis and sales trend analysis on a monthly basis

Question: How do different customer segments behave?

FUNCTION 1: Standardized Sales by Demographics: To compare sales performance across demographic groups within departments by using Z-Score

# INPUT: Demographic variable (age, income, household_size, kids_count)
# OUTPUT: Bar chart showing standardized sales by department and demographic group
plot_std_sales <- function(var, data = df_demo, top_n = 2) {
  var_quo <- enquo(var)
  var_name <- quo_name(var_quo)
  
  # Get top N departments
  top_dept <- data %>%
    summarise(total_sales = sum(sales_value), .by = department) %>%
    slice_max(total_sales, n = top_n) %>%
    pull(department)
  
  # Calculate z-scores: **Positive z-score = above-average sales for that demographic **Negative z-score = below-average sales
  data %>%
    filter(department %in% top_dept) %>%
    mutate(.var = ifelse(is.na(!!var_quo), "Unknown", as.character(!!var_quo))) %>%
    summarise(total_sales = sum(sales_value), .by = c(department, .var)) %>%
    group_by(department) %>%
    mutate(z = as.numeric(scale(total_sales))) %>%
    ungroup() %>%
    ggplot(aes(department, z, fill = .var)) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
    geom_col(position = position_dodge(width = 0.8), width = 0.75) +
    labs(
      title = paste0("Standardized Sales by Department and ", var_name),
      x = "Department",
      y = "Standardized Sales (z-score)",
      fill = var_name
    ) +
    scale_fill_brewer(palette = "Spectral", direction = -1) +
    theme_minimal(base_size = 12) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

This function is designed to identify the major consumer groups across different household characteristics within the top n product categories by sales value. The calculation is based on the Z-score approach, applied separately to each product category to distinguish between major and minor consumer segments for a given household feature. For example, in the Grocery category, households are grouped by age. After calculating the mean and standard deviation of sales within that category, Z-scores are used to standardize each age group’s average spending. This allows identification of which household age groups are the primary consumers for Grocery products.

In this project, groups with positive Z-scores are defined as “Major Groups”, representing households with above-average spending in a given category, while those with negative Z-scores are defined as “Minor Groups.” This classification is consistently applied throughout the subsequent time-series and correlation analyses.

Z-score charts comparing demographic groups within each department to identify over/under-performing customer segments as major and minor group, respectively

# Standardized sales plots
plot_std_sales(age) #Age

plot_std_sales(income) #Income

plot_std_sales(household_size) #Household size

plot_std_sales(kids_count) #Kids number in each household

As mentioned earlier, this project focuses on the Grocery and Drug GM product categories. From the Z-score plots, it can be observed that, from the perspective of household age, the primary consumers across the top eight product categories are households aged 35–44 and 45–54. Therefore, these two groups are defined as the Major Groups for the “Age” characteristic, while all other age ranges are classified as Minor Groups. Similarly, the following table summarizes the defined Major and Minor Groups for each household feature:

Age: Major – 35–44, 45–54; Minor – Others
Income: Major – 35–49K, 50–74K, 75–99K; Minor – Others
Household Size: Major – 1, 2; Minor – 3, 4, 5+
Kids Count: Major – 0; Minor – 1, 2, 3+

These group definitions are subsequently applied to analyze temporal spending trends and coupon usage behavior across different consumer segments.

FUNCTION 2: Monthly Sales Trend by Demographics: To track how different demographic groups’ spending changes over time

# INPUT:
#   - variable: demographic dimension (age, income, household_size, kids_count)
#   - groups: specific levels to analyze (e.g., c("35-44", "45-54"))
#   - title_prefix: descriptive text for chart title
#   - color_palette: optional custom colors for groups
# OUTPUT: Faceted line chart (one panel per department, one line per demographic group)
plot_monthly_trend <- function(variable, groups, title_prefix = "", 
                               color_palette = NULL, data = df_demo) {
  trend_data <- data %>%
    filter(.data[[variable]] %in% groups, department %in% dept2) %>%
    summarise(total_sales = sum(sales_value), 
              .by = c(department, month, .data[[variable]]))
  
  p <- ggplot(trend_data, aes(month, total_sales, 
                              color = .data[[variable]], 
                              group = .data[[variable]])) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2) +
    facet_wrap(~department, scales = "free_y") +
    labs(
      title = paste0(title_prefix, paste(groups, collapse = ", ")),
      x = "Month",
      y = "Total Sales ($)",
      color = variable
    ) +
    theme_minimal(base_size = 14) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "top"
    )
  
  if (!is.null(color_palette)) {
    p <- p + scale_color_manual(values = color_palette)
  }
  p
}

Building on the previously defined group classifications, this function analyzes the monthly sales fluctuations within each consumer segment.

Define Customer Segments by Z-Score for Trend Analysis

Question: When do different groups shop most?

# Define demographic groups
age_major <- c("35-44", "45-54") #over-performing customer segment by age
age_minor <- c("19-24", "55-64", "65+") #under-performing customer segment by age
income_major <- c("35-49K", "50-74K", "75-99K") # over-performing customer segment by income
income_minor <- c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+") #under-performing customer segment by income
size_major <- c("1", "2") #over-performing customer segment by household size
size_minor <- c("3", "4", "5") #under-performing customer segment by household size
kids_major <- c("0") #over-performing customer segment by kids number
kids_minor <- c("1", "2", "3+") #under-performing customer segment by kids number

# Monthly trends
plot_monthly_trend("age", age_major, "Age Groups(Major): ",
                   color_palette = c("pink", "#f4a261"))

plot_monthly_trend("age", age_minor, "Age Groups (Minor): ")

plot_monthly_trend("income", income_major, "Income (Major): ",
                   color_palette = c("pink", "#f4a261", "#e76f51"))

plot_monthly_trend("income", income_minor, "Income Groups (Minor): ")

plot_monthly_trend("household_size", size_major, "Household Size (Major): ",
                   color_palette = c("pink", "#f4a261"))

plot_monthly_trend("household_size", size_minor, "Household Size (Minor): ")

plot_monthly_trend("kids_count", kids_major, "Kids Count (Major):",
                   color_palette = c("pink"))

plot_monthly_trend("kids_count", kids_minor, "Kids Count (Minor): ")

For the Age characteristic, both Drug GM and Grocery show a clear year-end effect among the major consumer groups, particularly within the 45–54 age segment. Meanwhile, the minor groups also display slightly higher spending in the latter half of the year, likely influenced by seasonal holidays such as Halloween, Thanksgiving, and Christmas, with the 55–64 age group exhibiting this pattern most prominently.

For Income, the major groups also show a mild year-end effect, while the minor groups maintain relatively stable spending levels throughout the year.

Regarding Household Size, the year-end effect is much stronger among the major groups than the minor ones — in both Drug GM and Grocery, sales in the second half of the year significantly exceed those in the first half.

Finally, the Kids Count variable shows the most pronounced seasonal pattern, even stronger than that of household size.

Taken together, these results suggest that the higher spending observed in the latter half of the year may be influenced by various holiday seasons. I therefore hypothesize that this trend could be related to Regork’s promotional strategies, particularly its coupon policy.
To test this, I next examine whether coupon usage rate and sales value exhibit a time-series correlation.

4. Coupon Analysis Functions

# Main analysis function
analyze_group <- function(df_demo, transactions, group_var, levels_vec,
                          departments = dept2, min_mean_tx = 100, do_pp = TRUE) {
  
  # Filter households by demographic group
  hh_ids <- df_demo %>%
    filter(department %in% departments, 
           .data[[group_var]] %in% levels_vec) %>%
    distinct(household_id) %>%
    pull()
  
  # Prepare transaction details with coupon info
  tx_details <- transactions %>%
    filter(household_id %in% hh_ids) %>%
    inner_join(products, by = "product_id") %>%
    filter(department %in% departments) %>%
    mutate(
      week = isoweek(transaction_timestamp),
      coupon_used = as.integer((coupon_disc + coupon_match_disc) > 0),
      season = factor(paste0("Q", quarter(transaction_timestamp)), 
                     levels = c("Q1", "Q2", "Q3", "Q4"))
    )
  
  ## The definition of "Coupon Usage Rate": Coupon Usage Rate is the proportion of transactions in which at least one coupon or discount was applied, calculated as a ratio over a specified time period and product category.
  
  # Weekly aggregation
  weekly_tbl <- tx_details %>%
    summarise(
      total_sales = sum(sales_value),
      total_tx = n(),
      coupon_tx = sum(coupon_used),
      coupon_use_rate = mean(coupon_used),
      .by = c(department, week)
    ) %>%
    group_by(department) %>%
    filter(mean(total_tx) >= min_mean_tx) %>%
    ungroup()
  
  # Seasonal aggregation
  season_tbl <- tx_details %>%
    summarise(coupon_use_rate = mean(coupon_used),
              .by = c(department, season))
  
  # Regression analysis with PP unit-root test: to avoid spurious time-series regression
  reg_tbl <- weekly_tbl %>%
    nest_by(department) %>%
    mutate(
      fit = list(safely(lm)(total_sales ~ coupon_use_rate, data = data)$result),
      has_fit = !is.null(fit),
      coef = if_else(has_fit, coef(fit)[2], NA_real_),
      p_value = if_else(has_fit, glance(fit)$p.value, NA_real_),
      r2 = if_else(has_fit, glance(fit)$r.squared, NA_real_),
      r = if_else(has_fit && nrow(data) >= 3,
                  cor(data$coupon_use_rate, data$total_sales, 
                      use = "complete.obs"),
                  NA_real_),
      pp_sales_p = if (do_pp) {
        tryCatch(pp.test(data$total_sales)$p.value, error = \(e) NA_real_)
      } else NA_real_,
      pp_coupon_p = if (do_pp) {
        tryCatch(pp.test(data$coupon_use_rate)$p.value, error = \(e) NA_real_)
      } else NA_real_
    ) %>%
    ungroup() %>%
    select(department, pp_sales_p, pp_coupon_p, coef, p_value, r2, r)
  
  # Generate plots
  p1 <- ggplot(season_tbl, aes(season, coupon_use_rate * 100, fill = season)) +
    geom_col(width = 0.7) +
    geom_text(aes(label = percent(coupon_use_rate, accuracy = 0.1)),
              vjust = -0.2, size = 3.5) +
    facet_wrap(~department, ncol = 4) +
    scale_y_continuous("Coupon Usage Rate (%)") +
    scale_fill_brewer(palette = "Set2", guide = "none") +
    labs(title = "Coupon Usage Rate by Department and Season",
         x = "Season") +
    theme_minimal(base_size = 13)
  
  # Dual axis plot
  max_sales <- max(weekly_tbl$total_sales, na.rm = TRUE)
  max_rate <- max(weekly_tbl$coupon_use_rate, na.rm = TRUE)
  sf <- if_else(max_rate > 0, max_sales / max_rate, 1)
  
  p2 <- weekly_tbl %>%
    mutate(rate_scaled = coupon_use_rate * sf) %>%
    ggplot(aes(week)) +
    geom_line(aes(y = total_sales, color = "Total Sales"), linewidth = 0.8) +
    geom_line(aes(y = rate_scaled, color = "Coupon Use Rate"), linewidth = 0.9) +
    facet_wrap(~department, scales = "free_y") +
    scale_y_continuous(
      name = "Total Sales ($)",
      sec.axis = sec_axis(~. / sf, name = "Coupon Use Rate",
                          labels = percent_format(accuracy = 1))
    ) +
    scale_color_manual(
      NULL,
      values = c("Coupon Use Rate" = "#2b8cbe", "Total Sales" = "#f4a261")
    ) +
    labs(title = "Weekly Sales and Coupon Usage by Department",
         x = "Week Number") +
    theme_minimal(base_size = 13) +
    theme(legend.position = "top")
  
  # Scatter plot with correlation
  p3 <- ggplot(weekly_tbl, aes(coupon_use_rate, total_sales)) +
    geom_point(alpha = 0.6, size = 1.8, color = "#2b8cbe") +
    geom_smooth(method = "lm", se = FALSE, color = "#f4a261", linewidth = 0.8) +
    facet_wrap(~department, scales = "free_y") +
    scale_x_continuous("Coupon Use Rate", labels = percent) +
    scale_y_continuous("Total Sales ($)") +
    labs(
      title = "Relationship between Coupon Use Rate and Total Sales",
      subtitle = "Weekly data with linear trend and correlation coefficient"
    ) +
    geom_text(
      data = reg_tbl %>% 
        mutate(label = paste0("r = ", 
                             ifelse(is.na(r), "NA", number(r, accuracy = 0.01)))),
      aes(x = -Inf, y = Inf, label = label),
      hjust = -0.1, vjust = 1.1, size = 4, fontface = "bold",
      inherit.aes = FALSE
    ) +
    theme_minimal(base_size = 13)
  
  list(
    weekly_tbl = weekly_tbl,
    season_tbl = season_tbl,
    reg_table = reg_tbl,
    plot_season = p1,
    plot_weekly = p2,
    plot_scatter = p3
  )
}

# Visualization wrapper
peek_one <- function(group_var, level, title = NULL, show_table = TRUE) {
  res <- analyze_group(df_demo, transactions, group_var, level)
  
  nm <- title %||% paste(group_var, level, sep = " : ")
  
  print(res$plot_season + ggtitle(paste0(nm, " — Quarterly Trend")))
  print(res$plot_weekly + ggtitle(paste0(nm, " — Weekly Trend")))
  print(res$plot_scatter + ggtitle(paste0(nm, " — Scatter (Weekly Data)")))
  
  if (show_table && !is.null(res$reg_table)) {
    print(knitr::kable(res$reg_table, digits = 3, 
                       caption = paste("Regression Results:", nm)))
  }
  
  invisible(res)
}

The above main function of this project analyzes the relationship between coupon usage and sales for a specific customer segment (e.g., age 35-44) in selected departments.

The coupon usage rate is defined as “the percentage of transactions that used at least one coupon during a specific time period”, the mathematical formula would be: (Number of transactions with coupons) / (Total transactions). For weekly level, I calculate coupon_used = I((coupon_disc + coupon_match_disc) > 0) first, and then aggregate the coupon_use_rate: coupon_use_rate = Σ(coupon_used) / count(*) = mean(coupon_used) on a weekly basis.

The default value of “min_mean_tx” is 100, which means departments with insufficient data (<100) in each period are excluded for statistical robustness. The “do_pp” shows the Phillips-Perron stationarity test to ensure stationarity of the weekly coupon_use_rate and sales value series. Once the stationarity is confirmed, the times-series regression of coupon_use_rate and sales value can be examined (there will be no spurious regression issue).

To maintain clarity, one representative example is provided for each household characteristic. In all cases, both the weekly coupon usage rate and sales value exhibit stationarity. The detailed results for each example are described below.

5. Run Analysis (I only choose one for each major/minor group which shows a strong statistical evidence due to page limit)

Age Groups: Show how different age groups respond to coupons by visualizing the trend of coupon usage rate (quarterly) and correlation plots between sales value and coupon usage rate (weekly)

Question: Should we target coupons to specific age brackets?

# Major group (35-44)
peek_one("age", "35-44")

## 
## 
## Table: Regression Results: age : 35-44
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|--------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01| 3811.285|   0.157| 0.040| 0.199|
## |GROCERY    |       0.01|        0.01| 6349.300|   0.588| 0.006| 0.077|

# Minor group (65+)
peek_one("age", "65+")

## 
## 
## Table: Regression Results: age : 65+
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|    r2|      r|
## |:----------|----------:|-----------:|--------:|-------:|-----:|------:|
## |DRUG GM    |       0.01|        0.01| -540.569|   0.706| 0.003| -0.054|
## |GROCERY    |       0.01|        0.01| 5520.622|   0.154| 0.040|  0.201|

Starting with the Age Major Group (35–44), the quarterly coupon usage rate also exhibits a year-end effect, consistent with the sales value pattern observed earlier — particularly evident in the Drug GM category. In the table, pp_sales and pp_coupon represent the p-values from the Phillips–Perron (PP) unit root tests for the sales and coupon usage rate weekly series, respectively. A p-value below the chosen significance level indicates that the series is stationary at that level. coef and r² denote the estimated regression coefficient and the model’s coefficient of determination (R-squared) from the time-series regression. r represents the correlation coefficient between the two series as shown in the scatter plot, where each point corresponds to one weekly observation.

The weekly trend plots generally align with the Phillips–Perron (PP) test results, indicating that the series is stationary. While the scatter plot shows a slight positive association between the coupon usage rate and sales for Drug GM products, the time-series regression reveals that the estimated coefficient is not statistically significant.

This insignificance could be due to autocorrelation within the series, heteroskedasticity, or simply the absence of a strong relationship between the two variables. A similar conclusion applies to the Minor Group (65+), where the statistical evidence remains insufficient to assert that a higher coupon usage rate leads to greater sales value. Therefore, without incorporating additional data or explanatory variables, there is not enough empirical support to justify adopting a more aggressive coupon policy for these age segments.

Income Groups: Show how different income groups respond to coupons by visualizing the trend of coupon usage rate (quarterly) and correlation plots between sales value and coupon usage rate (weekly)

Question: Should we target coupons to specific income brackets?

# Major group (75-99K)
peek_one("income", "75-99K")

## 
## 
## Table: Regression Results: income : 75-99K
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|--------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01|    9.174|   0.994| 0.000| 0.001|
## |GROCERY    |       0.01|        0.01| 1389.697|   0.748| 0.002| 0.046|

# Minor group (125-149K)
peek_one("income", "125-149K")

## 
## 
## Table: Regression Results: income : 125-149K
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|--------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01|  227.419|   0.744| 0.002| 0.046|
## |GROCERY    |       0.01|        0.01| 2013.026|   0.516| 0.008| 0.092|

For the Income characteristic, both the major and minor group examples exhibit an even weaker pattern compared to the age-based segmentation. Although the series are confirmed to be stationary, neither the scatter plots nor the time-series regression results show any meaningful relationship — the correlation coefficients and regression estimates are close to zero or statistically insignificant. This indicates that, when classifying consumers by income, coupon usage rate does not effectively differentiate spending behavior or predict higher sales growth. Therefore, it can be clearly concluded that income is not a suitable criterion for implementing targeted coupon strategies.

Household Size: Show how different size groups respond to coupons by visualizing the trend of coupon usage rate (quarterly) and correlation plots between sales value and coupon usage rate (weekly)

Question: Should we target coupons to specific household size brackets?

# Major group (2 people)
peek_one("household_size", "2")

## 
## 
## Table: Regression Results: household_size : 2
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|   r2|     r|
## |:----------|----------:|-----------:|--------:|-------:|----:|-----:|
## |DRUG GM    |       0.01|        0.01|  5288.85|   0.313| 0.02| 0.143|
## |GROCERY    |       0.01|        0.01| 26586.52|   0.058| 0.07| 0.265|

# Minor group (4 people)
peek_one("household_size", "4")

## 
## 
## Table: Regression Results: household_size : 4
## 
## |department | pp_sales_p| pp_coupon_p|      coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|---------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01|   733.076|   0.405| 0.014| 0.118|
## |GROCERY    |       0.01|        0.01| 13331.507|   0.005| 0.146| 0.382|

Next, I analyze the Household Size characteristic. In this case, both the Major and Minor groups exhibit stationary series. For the Major group — households with a size of 2 — the scatter plots show a slight positive relationship between coupon usage rate and sales for both Drug GM and Grocery. The time-series regression results indicate that this relationship is not significant for Drug GM, but for Grocery it is close to significance at the 5% level and becomes significant at the 10% level. If further evidence (such as additional data or larger sample size) supports this pattern, the CEO may consider strengthening coupon promotions targeted at two-person households, particularly within the Grocery category, where this group represents a major consumer segment.

On the other hand, the Minor group — households with a size of 4 — shows a similar trend, but the correlation is even stronger.
For Grocery products, the relationship remains significant at the 1% level, and the scatter plot reveals a moderate correlation coefficient of 0.38 between coupon usage rate and total sales. Given that this type of household is not a primary Grocery consumer group, if this causal relationship holds, expanding coupon promotions toward such consumers could effectively stimulate spending and increase Regork’s overall revenue.

Kids Count: Show how different kids_count groups respond to coupons by visualizing the trend of coupon usage rate (quarterly) and correlation plots between sales value and coupon usage rate (weekly)

Question: Should we target coupons to specific kids_count brackets?

# Major group (0 kids)
peek_one("kids_count", "0")

## 
## 
## Table: Regression Results: kids_count : 0
## 
## |department | pp_sales_p| pp_coupon_p|     coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|--------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01|  3890.34|   0.677| 0.003| 0.059|
## |GROCERY    |       0.01|        0.01| 19678.35|   0.408| 0.014| 0.117|

# Minor group (2 kids)
peek_one("kids_count", "2")

## 
## 
## Table: Regression Results: kids_count : 2
## 
## |department | pp_sales_p| pp_coupon_p|      coef| p_value|    r2|     r|
## |:----------|----------:|-----------:|---------:|-------:|-----:|-----:|
## |DRUG GM    |       0.01|        0.01|   690.376|   0.527| 0.008| 0.090|
## |GROCERY    |       0.01|        0.01| 18056.638|   0.012| 0.120| 0.346|

Finally, regarding the Kids Count characteristic, the year-end effect once again appears in both the Major and Minor groups.
The statistical results show that all series are stationary. For the Major group — households without children — the relationship between coupon usage rate and sales value is not statistically significant. However, for the Minor group — households with two children — the results are striking: the correlation between coupon usage rate and sales value in the Grocery category reaches a moderate level of significance, and the time-series regression confirms this relationship at the 5% significance level.

Based on the findings from both Kids Count and Household Size, I propose the following hypothesis: Households with more dependents tend to be more sensitive to coupon incentives. Therefore, strengthening coupon promotions targeted toward larger families could be an effective strategy to stimulate their spending — a potential direction for the CEO to consider when refining Regork’s coupon policy.

Summary

How I Answered the Business Problem

This project focuses not only on analyzing the temporal variations in sales value but also on examining whether there exists a statistically significant relationship between coupon usage rate and sales value across different customer groups and product categories. To simplify the analysis and maintain clarity, I first identify the product categories with the highest sales contributions. The results show that Grocery and Drug GM together account for more than half of Regork’s total sales revenue. Therefore, the analysis concentrates on these two key product categories, which have the greatest impact on overall sales performance.

I then visualize the monthly sales trends for these categories and observe a clear year-end effect. This phenomenon likely stems from multiple holidays concentrated in Q4, which may have led Regork to intensify its coupon promotion efforts and, in turn, encouraged higher consumer spending during this period.

Next, households are classified according to their demographic characteristics — including age, income, household size, and kids count. Within each feature, groups are defined using Z-score analysis to distinguish between households with above-average spending (Major Groups) and those with below-average spending (Minor Groups).

Additionally, both Grocery and Drug GM categories are analyzed in terms of their coupon usage rate (defined as the number of coupon transactions divided by total transactions) and weekly sales value. The Phillips–Perron (PP) test is applied to ensure the stationarity of the time-series data, thereby avoiding the issue of spurious regression in subsequent analyses. Under this framework, only groups with a minimum of 100 weekly transaction samples are retained to ensure the stability and reliability of the results.

Ultimately, the analysis produces weekly averages of coupon usage rate and sales value for each group, along with corresponding scatter plots and time-series regression outputs. These results allow for the preliminary identification of customer segments and product categories where coupon usage rate and sales value exhibit statistically significant relationships.

Insights and my suggestion

The project first reveals that many customer groups across various product categories exhibit a clear year-end tendency to increase their spending. Furthermore, the statistical results indicate that income is not an effective criterion for differentiating consumers when evaluating the effectiveness of the coupon policy in boosting spending on Grocery and Drug GM products. This may be due to the fact that these two categories consist mainly of essential goods, for which consumers tend to be less sensitive to price discounts. Therefore, the CEO should avoid using income level as the primary basis for designing or targeting coupon strategies.

In contrast, variables related to household dependents — specifically Kids Count and Household Size — show significant relationships with sales in the Grocery category. This suggests that households with more dependents are more responsive to coupon incentives and more likely to increase their spending when coupons are offered. From a managerial perspective, this insight provides a valuable criterion for refining Regork’s coupon policy: given similar profit margins, intensifying coupon promotions targeted toward larger families could potentially lead to additional sales growth.

Limitations of This Analysis and improvement

First, this analysis is constrained by the limited size of the dataset, which covers only one year of observations. As a result, it is not possible to determine whether similar seasonal or behavioral patterns persist across different years. The findings may also vary under different phases of the economic cycle. Moreover, since there are only 52 weekly data points, the sample size for time-series regressions is relatively small. This may reduce the robustness of certain statistical properties, such as stationarity and the significance of regression coefficients. In addition, each weekly observation represents the average of households that fit a given demographic feature, which further reduces variability. For robustness, some product categories (e.g., Fuel) had to be excluded from the analysis, and others could not be studied at all due to the absence of coupon redemption records.

Second, the overall coupon redemption rate is quite low—only in the single-digit percentage range. Incorporating other forms of promotional incentives, such as credit card cashback programs or loyalty point rewards, could make the analysis more comprehensive and statistically meaningful. A combined examination of multiple promotional strategies may provide a more accurate understanding of how Regork’s customers respond to discounts and rewards.

Despite these limitations, the analysis provides preliminary evidence for refining coupon targeting strategies and guiding future promotional planning.

Final Project: Does coupon usage rate play a crucial role on sales value by descriminating customers?

Group18: Yumin (Kenneth) Wang, M#17344635

2025-10-07