1 Introduction

In today’s retail industry, direct mail marketing remains an important channel for driving customer engagement and sales. However, not all customers respond to promotional offers equally, and identifying who is most likely to respond can help companies allocate resources more efficiently.

This project uses the Clothing Store Dataset—a real dataset from a New England retail chain—to develop a predictive model for customer response behavior. The primary goal is to determine which customer characteristics best predict whether an individual will respond (“yes”) to a direct mail promotion.

To address this, we conduct exploratory data analysis (EDA) to examine relationships among purchasing, engagement, and demographic variables, then build a multiple logistic regression model to quantify how these factors influence response likelihood. By identifying the most meaningful predictors, this analysis provides actionable insight into which types of customers are most receptive to future marketing efforts.

2 Description of Dataset

These are 20 explanatory variables in this dataset:

  1. HHKEY (categorial) → Unique encrypted customer ID
  2. ZIP_CODE (categorial) → Customer’s ZIP code
  3. FRE (numerical) → Number of purchase visits
  4. MON (numerical) → Total net sales
  5. AVRG (numerical) → Average spend per visit
  6. AMSPEND, PSSPEND, CSSPEND, AXSPEND (categorial) → Spend across four different franchise brands
  7. OMONSPEND, TMONSPEND, SMOSPEND (categorial) → Spend over past 1, 3, and 6 months
  8. PREVPD (categorial) → Spend in the same period last year
  9. GMP (numerical) → Gross margin percentage
  10. PROMOS (numerical) → Number of marketing promotions on file
  11. DAYS (numerical) → Number of days customer has been on file
  12. FREDAYS, LTFREDAY (categorial) → Time between purchases (recent & lifetime average)
  13. CLASSES (numerical) → Number of different product classes purchased
  14. STYLES (numerical) → Number of individual items purchased
  15. STORES (numerical) → Number of stores shopped at
  16. MARKDOWN, COUPONS, MAILED, RESPONDE, RESPONSERATE (categorial) → Promotion and discount engagement
  17. HI (numerical) → Product uniformity (lower = more diverse purchases)
  18. CLUSTYPE (categorical) → Lifestyle cluster type (encrypted)
  19. PERCRET (numerical) → Percent of returns
  20. CC_CARD, VALPHON, WEB (categorial) → Flags for credit card, valid phone number, and web shopper status

Our response variable is:

RESP → whether a customer responded to a promotion (yes/no)

We have 2847 observations, or customers, in this dataset. We will now remove all observations with missing values.

CSDS_clean <- na.omit(CSDS) 
response <- factor(CSDS_clean$RESP, levels = c("no","yes"))

After removing missing values, 2619 customers remain for modeling.

3 Exploratory Data Analysis

To begin, we will make a correlation matrix with all of the numerical variables to see if any variables are highly correlated with each other.

# 1) Manually list the numeric variables you want
vars <- c("FRE","MON", "AVRG","GMP","PROMOS","DAYS","CLASSES","STYLES","STORES","HI","PERCRET")

# 2) Keep only the ones that actually exist in your data
vars <- intersect(vars, names(CSDS_clean))

# 3) Correlation matrix (errors if any are non-numeric)
cor_mat <- cor(CSDS_clean[ , vars, drop = FALSE],
               use = "pairwise.complete.obs",
               method = "pearson")
cor_mat
               FRE          MON         AVRG          GMP      PROMOS
FRE      1.0000000  0.674209463 -0.152217058 -0.114920260  0.46902853
MON      0.6742095  1.000000000  0.271555910  0.038859950  0.38456147
AVRG    -0.1522171  0.271555910  1.000000000  0.287400964  0.04592048
GMP     -0.1149203  0.038859950  0.287400964  1.000000000  0.05949538
PROMOS   0.4690285  0.384561473  0.045920482  0.059495381  1.00000000
DAYS     0.3910812  0.261051031 -0.122708281 -0.021326578  0.60409827
CLASSES  0.7948576  0.675174039  0.087937020 -0.105308252  0.52413518
STYLES   0.7931651  0.935759837  0.135948828 -0.076199344  0.40263287
STORES   0.5909190  0.430123659 -0.088488953 -0.103484318  0.37306720
HI      -0.4323462 -0.309591043 -0.002372335  0.150835861 -0.33013811
PERCRET  0.1559934  0.009115226 -0.180623484 -0.007074185  0.04475647
               DAYS     CLASSES      STYLES      STORES           HI
FRE      0.39108118  0.79485761  0.79316506  0.59091903 -0.432346183
MON      0.26105103  0.67517404  0.93575984  0.43012366 -0.309591043
AVRG    -0.12270828  0.08793702  0.13594883 -0.08848895 -0.002372335
GMP     -0.02132658 -0.10530825 -0.07619934 -0.10348432  0.150835861
PROMOS   0.60409827  0.52413518  0.40263287  0.37306720 -0.330138107
DAYS     1.00000000  0.40606251  0.29344949  0.32782601 -0.277643174
CLASSES  0.40606251  1.00000000  0.75321727  0.60280152 -0.619839913
STYLES   0.29344949  0.75321727  1.00000000  0.51226171 -0.360716264
STORES   0.32782601  0.60280152  0.51226171  1.00000000 -0.414095001
HI      -0.27764317 -0.61983991 -0.36071626 -0.41409500  1.000000000
PERCRET  0.03418407  0.10271670  0.06785920  0.09372079 -0.252625218
             PERCRET
FRE      0.155993396
MON      0.009115226
AVRG    -0.180623484
GMP     -0.007074185
PROMOS   0.044756473
DAYS     0.034184072
CLASSES  0.102716699
STYLES   0.067859197
STORES   0.093720788
HI      -0.252625218
PERCRET  1.000000000

Pairwise correlations reveal strong relationships among purchasing activity variables. For example, FRE (visits) correlates highly with STYLES (items purchased) and CLASSES (product classes), and MON (net sales) correlates very strongly with STYLES. These patterns are intuitive—customers who shop more often tend to buy more items across more classes, leading to higher sales. We retain these variables initially and address redundancy later during model refinement.

We will now conduct frequency distributions for all of the numerical variables to identify any patterns and skewness.

par(mfrow=c(1,2))

hist(CSDS_clean$FRE, xlab="# of Purchase Visits", main = "")

hist(CSDS_clean$MON, xlab="Total Net Sales", main = "")

hist(CSDS_clean$AVRG, xlab="Average Spend Per Visit", main = "")

hist(CSDS_clean$GMP, xlab="Gross Margin %", main = "")

hist(CSDS_clean$PROMOS, xlab="# of Marketing Promos", main = "")

hist(CSDS_clean$DAYS, xlab="Days on File", main = "")

hist(CSDS_clean$CLASSES, xlab="# of Different Products Purchased", main = "")

hist(CSDS_clean$STYLES, xlab="# of Individual Items Purchased", main = "")

hist(CSDS_clean$STORES, xlab="# of Stores Shopped At", main = "")

hist(CSDS_clean$HI, xlab="Product Uniformity", main = "")

hist(CSDS_clean$PERCRET, xlab="% of Returns", main = "")

None of our variables are distributed normally. Number of purchase visits, Total Net Sales, Average Spend Per Visit, Number of Marketing Promos, Number of Different Products Purchased, Number of Individual Items Purchased, Number of Stores Shopped At, Product Uniformity, and Percent of Returns are all skewed to the right. Gross Margin Percentage and Days on File are skewed left. We will apply appropriate transformations prior to modeling.

4 Multiple Logistic Regression Model

Before building our multiple logistic regression model, we need to address the high levels of skewness observed in many of our numerical predictors. Variables such as Total Net Sales (MON), Number of Purchase Visits (FRE), Number of Individual Items Purchased (STYLES), and Percent of Returns (PERCRET) were found to be strongly right-skewed, while Gross Margin Percentage (GMP) showed moderate left skew. To ensure more stable coefficient estimates and improve linearity with the log-odds of the response variable, we apply several transformations. Right-skewed, strictly positive variables are transformed using the log1p() function, which safely handles zeros and reduces the influence of large outliers. The percentage variable PERCRET is first converted to a proportion and then transformed using the logit (qlogis()) function, allowing it to vary smoothly on the real number line. We also drop Average Spend per Visit (AVRG) due to redundancy with MON and FRE, which helps minimize multicollinearity. The resulting dataset provides a more balanced distribution of predictors and creates a stronger foundation for modeling customer response probabilities.

dat <- CSDS_clean %>%
  # ensure RESP is a proper factor first
  dplyr::mutate(
    RESP = case_when(
      RESP %in% c("Yes","yes",1) ~ "yes",
      RESP %in% c("No","no",0)   ~ "no",
      TRUE ~ NA_character_
    ),
    RESP = factor(RESP, levels = c("no","yes"))
  ) %>%
  # transforms
  dplyr::mutate(
    PERCRET_p     = pmin(pmax(PERCRET/100, 1e-6), 1 - 1e-6),
    PERCRET_logit = qlogis(PERCRET_p),
    dplyr::across(
      tidyselect::any_of(c("MON","FRE","STYLES","CLASSES","STORES","HI")),
      ~ log1p(.),
      .names = "{.col}_log"
    )
  ) %>%
  # select ONLY with fully-qualified dplyr
  dplyr::select(dplyr::all_of(
    c("RESP","MON_log","FRE_log","STYLES_log","CLASSES_log","STORES_log",
      "HI_log","PERCRET_logit","PROMOS","DAYS","GMP")
  )) %>%
  tidyr::drop_na()

Here we fit the multiple logistic regression model predicting RESP (yes/no) using the transformed predictors from Section 4.1. The specification includes log-transformed behavioral measures (e.g., MON_log, FRE_log, STYLES_log, CLASSES_log, STORES_log, HI_log), the logit-transformed return rate (PERCRET_logit), and untransformed controls (PROMOS, DAYS, GMP). This model estimates how each feature is associated with the log-odds of responding while mitigating skew and multicollinearity. We report coefficient summaries and odds ratios to aid interpretation.

# Ensure RESP exists and is a factor with correct levels
dat$RESP <- factor(dat$RESP, levels = c("no", "yes"))

# Check that it worked
table(dat$RESP)

  no  yes 
2231  388 

4.1 Larger Model

We first fit a larger model, including MON_log, FRE_log, STYLES_log, CLASSES_log, STORES_log, HI_log, PERCRET_logit, PROMOS, DAYS, and GMP.

fit <- glm(RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log +
             HI_log + PERCRET_logit + PROMOS + DAYS + GMP,
           data = dat, family = binomial)

summary(fit)

Call:
glm(formula = RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + 
    STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP, 
    family = binomial, data = dat)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.6743606  0.8398896  -4.375 1.22e-05 ***
MON_log        0.1250662  0.2160837   0.579   0.5627    
FRE_log        1.2066864  0.2003772   6.022 1.72e-09 ***
STYLES_log     0.4930412  0.2729314   1.806   0.0708 .  
CLASSES_log   -0.5104403  0.3201110  -1.595   0.1108    
STORES_log     0.1785903  0.1983969   0.900   0.3680    
HI_log        -0.2125468  0.1905018  -1.116   0.2645    
PERCRET_logit -0.0176444  0.0237950  -0.742   0.4584    
PROMOS        -0.0250798  0.0124755  -2.010   0.0444 *  
DAYS          -0.0004144  0.0004243  -0.977   0.3287    
GMP           -0.6865513  0.4597422  -1.493   0.1353    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.3  on 2618  degrees of freedom
Residual deviance: 1791.3  on 2608  degrees of freedom
AIC: 1813.3

Number of Fisher Scoring iterations: 5

The logistic regression model predicts the likelihood that a customer responds (RESP = yes) to a marketing promotion based on purchasing and engagement variables. The negative intercept (–3.67, p < .001) suggests a low baseline probability of response.

The number of purchase visits (FRE_log) is the strongest and most significant predictor (β = 1.21, p < .001), meaning customers who shop more frequently are much more likely to respond. A one-unit increase in log-transformed purchase visits multiplies the odds of responding by about 3.35 times.

Number of items purchased (STYLES_log) shows a weak positive effect (p = .07), while number of promotions (PROMOS) has a small negative effect (β = –0.025, p = .04), suggesting that overexposure to promotions slightly reduces responsiveness.

All other variables were not statistically significant, indicating that frequent shoppers are the most reliable predictors of promotional response.

Before shortening our model, let’s see if there is multicollinearity within our predictor variables. We will run a VIF matrix.

vif(fit)
      MON_log       FRE_log    STYLES_log   CLASSES_log    STORES_log 
    12.263902      5.681323     16.487958     10.226876      1.557732 
       HI_log PERCRET_logit        PROMOS          DAYS           GMP 
     3.822498      2.126597      2.151240      1.752150      1.656215 

Variance inflation factors (VIF) indicate substantial multicollinearity among purchase-activity variables: MON_log, STYLES_log, and CLASSES_log are inflated (VIF > 10), and FRE_log is moderately inflated. Because MON_log and CLASSES_log are also not statistically significant, we remove them. We keep STYLES_log temporarily to reassess its signal after reducing multicollinearity.

4.2 Reduced Model

Here is our improved, new fitted model:

# Fit the reduced model
fit2 <- glm(RESP ~ FRE_log + PROMOS + STYLES_log,
            data = dat, family = binomial)

# Inspect results
summary(fit2)

Call:
glm(formula = RESP ~ FRE_log + PROMOS + STYLES_log, family = binomial, 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.51609    0.20694 -21.823  < 2e-16 ***
FRE_log      1.18762    0.17543   6.770 1.29e-11 ***
PROMOS      -0.03314    0.01078  -3.074  0.00211 ** 
STYLES_log   0.43305    0.14092   3.073  0.00212 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2197.3  on 2618  degrees of freedom
Residual deviance: 1797.9  on 2615  degrees of freedom
AIC: 1805.9

Number of Fisher Scoring iterations: 5

All predictors are statistically significant at the 0.01 level, indicating that shopping frequency, purchase variety, and promotion exposure meaningfully influence response likelihood.

FRE_log (β = 1.19, p < .001) → Customers who visit the store more frequently are much more likely to respond to a promotion. Each one-unit increase in log-transformed purchase frequency increases the odds of response by about 3.3× (e¹·¹⁹).

PROMOS (β = –0.033, p = .002) → Each additional promotion on file slightly reduces the odds of responding, suggesting that over-exposure to promotions may lead to lower engagement.

STYLES_log (β = 0.43, p = .002) → Customers who buy a greater number of individual items are also more likely to respond. A one-unit increase in log-transformed styles purchased increases the odds of response by about 1.5× (e⁰·⁴³).

Let’s take a look at the VIF matrix of our new fitted model to see if there’s still any multicollinearity.

vif(fit2)
   FRE_log     PROMOS STYLES_log 
  4.384577   1.604865   4.441630 

All three VIF values are below 5, meaning multicollinearity is no longer a serious issue.

4.3 Performance Comparison

We will now use deviance and AIC to assess the performance of our two models. We are looking for the model with the lowest AIC value. Deviance allows us to measure differences between the model and observed values. A low AIC value indicates a better balance between goodness-of-fit and model complexity.

## Other global goodness-of-fit
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(fit = global.measure(fit),
      fit2=global.measure(fit2)
     )
row.names(goodness) = c("larger.model", "reduced.model")
pander(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
  Deviance.residual Null.Deviance.Residual AIC
larger.model 1791 2197 1813
reduced.model 1798 2197 1806

We compare model fit using AIC and deviance. The reduced model achieves a lower AIC (≈ 1806) than the larger model (≈ 1813), indicating a better balance of fit and parsimony. While the reduced model’s residual deviance is slightly higher (worse raw fit), AIC penalizes unnecessary complexity—thus, overall, the reduced model is preferred.

4.4 Likelihood Ratio Chi-Squared Test

We test whether the larger model improves fit beyond the reduced model. The deviance difference is:

ΔG2=Devianc\(ereduced\)−Deviance\(larger\)=1798−1791=7

with df = 10 − 3 = 7 (difference in parameters). This yields a non-significant LRT (p ≈ 0.42), so the larger model does not provide a statistically meaningful improvement. We therefore retain the reduced model.

5 Final Model

Above, we explored two potential models that could predict whether a customer responds to a mail marketing promotion. Our first model, the larger model, consisted of all the numerical variables in the dataset. Our second model, the reduced model, consisted of only our VIF-free, statistically significant variables. These variables were FRE_log (number of purchase visits), number of marketing promotions on file (PROMOS), and number of different items purchased (STYLES_log). Here are our summary statistics for our final model, but this time we will include odds ratios to fully encapsulate the impact of this model.

model.coef.stats <- summary(fit2)$coef              # Estimate, SE, z, p
odds.ratio       <- exp(coef(fit2))                 # exponentiate coefficients

out.stats <- cbind(model.coef.stats, `Odds Ratio` = odds.ratio)
knitr::kable(out.stats, caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) Odds Ratio
(Intercept) -4.5160854 0.2069388 -21.823293 0.0000000 0.0109317
FRE_log 1.1876206 0.1754269 6.769888 0.0000000 3.2792693
PROMOS -0.0331400 0.0107794 -3.074371 0.0021095 0.9674031
STYLES_log 0.4330499 0.1409193 3.073034 0.0021189 1.5419531

Predictor Interpretation (Intercept) = –4.52, OR = 0.01, p < .001 The baseline odds of responding are very low when all predictors are at zero (i.e., for an infrequent shopper with no promotions and few purchases).

FRE_log = 1.19, OR = 3.28, p < .001 Each one-unit increase in log-transformed purchase frequency multiplies the odds of responding by about 3.3×, making it the strongest predictor. Frequent shoppers are much more likely to respond.

PROMOS = –0.033, OR = 0.97, p = .002 Each additional promotion slightly reduces the odds of responding by about 3%, suggesting that too many promotions may lead to fatigue or diminished engagement.

STYLES_log = 0.43, OR = 1.54, p = .002 Each one-unit increase in log-transformed number of items purchased increases the odds of responding by about 54%, indicating that customers who buy a wider variety of items tend to be more responsive.

6 Conclusion

We set out to identify which behaviors best predict response to a direct mail promotion. After cleaning, transforming, and addressing multicollinearity, a parsimonious logistic model with FRE_log, STYLES_log, and PROMOS outperformed the larger specification on AIC and was not significantly worse by an LRT. Results show that shopping frequency and breadth of items purchased are strong, positive drivers of response, whereas promotion saturation slightly reduces engagement. These findings suggest a practical strategy: prioritize outreach to high-frequency, high-variety shoppers while moderating promotion volume to avoid fatigue, thereby improving targeting efficiency and campaign lift.

---
title: "Predicting Customer Response to Direct Mail Promotions"
author: "Luke Volm"
date: "2025-10-10"
output:
  html_document:           # output document format
    toc: yes               # add table contents
    toc_float: yes         # toc_property: floating
    toc_depth: 4           # depth of TOC headings
    fig_width: 6           # global figure width
    fig_height: 4          # global figure height
    fig_caption: yes       # add figure caption
    number_sections: no   # numbering section headings
    toc_collapsed: yes     # TOC subheading collapsing
    code_folding: hide     # folding/showing code 
    code_download: yes     # allow to download complete RMarkdown source code
    smooth_scroll: yes     # scrolling text of the document
    theme: lumen           # visual theme for HTML document only
    highlight: tango       # code syntax highlighting styles
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
  word_document:
    toc: yes
    toc_depth: '4'
---

```{css, echo = FALSE}
div#TOC lwe{     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
library(readxl)
library(boot)
library(dplyr)
library(knitr)
library(psych)
library(MASS)
library(tidyr)
library(ggplot2)
library(car)
library(pander)

# Set seed for reproducibility
set.seed(123)

# Read in data (drop first column if it's just an index/ID)
setwd("C:/Users/volm1/OneDrive/Desktop/STA321 new")
CSDS <- read.csv("Clothing Store Data Set.csv")
# Global chunk options
knitr::opts_chunk$set(
  echo = TRUE,      # show code
  warning = FALSE,  # suppress warnings
  message = FALSE,  # suppress messages
  results = TRUE,   # show results
  comment = NA      # cleaner output (no "##" prefix)
)
```

# 1 Introduction

In today’s retail industry, direct mail marketing remains an important channel for driving customer engagement and sales. However, not all customers respond to promotional offers equally, and identifying who is most likely to respond can help companies allocate resources more efficiently.

This project uses the Clothing Store Dataset—a real dataset from a New England retail chain—to develop a predictive model for customer response behavior. The primary goal is to determine which customer characteristics best predict whether an individual will respond (“yes”) to a direct mail promotion.

To address this, we conduct exploratory data analysis (EDA) to examine relationships among purchasing, engagement, and demographic variables, then build a multiple logistic regression model to quantify how these factors influence response likelihood. By identifying the most meaningful predictors, this analysis provides actionable insight into which types of customers are most receptive to future marketing efforts.

# 2 Description of Dataset

These are 20 explanatory variables in this dataset: 

1. HHKEY (categorial) → Unique encrypted customer ID
2. ZIP_CODE (categorial) → Customer’s ZIP code
3. FRE (numerical) → Number of purchase visits
4. MON (numerical) → Total net sales
5. AVRG (numerical) → Average spend per visit
6. AMSPEND, PSSPEND, CSSPEND, AXSPEND (categorial) → Spend across four different franchise brands
7. OMONSPEND, TMONSPEND, SMOSPEND (categorial) → Spend over past 1, 3, and 6 months
8. PREVPD (categorial) → Spend in the same period last year
9. GMP (numerical) → Gross margin percentage
10. PROMOS (numerical) → Number of marketing promotions on file
11. DAYS (numerical) → Number of days customer has been on file
12. FREDAYS, LTFREDAY (categorial) → Time between purchases (recent & lifetime average)
13. CLASSES (numerical) → Number of different product classes purchased
14. STYLES (numerical) → Number of individual items purchased
15. STORES (numerical) → Number of stores shopped at
16. MARKDOWN, COUPONS, MAILED, RESPONDE, RESPONSERATE (categorial) → Promotion and discount engagement
17. HI (numerical) → Product uniformity (lower = more diverse purchases)
18. CLUSTYPE (categorical) → Lifestyle cluster type (encrypted)
19. PERCRET (numerical) → Percent of returns
20. CC_CARD, VALPHON, WEB (categorial) → Flags for credit card, valid phone number, and web shopper status

Our response variable is:

RESP → whether a customer responded to a promotion (yes/no)

We have 2847 observations, or customers, in this dataset. We will now remove all observations with missing values.

```{r}


CSDS_clean <- na.omit(CSDS) 
response <- factor(CSDS_clean$RESP, levels = c("no","yes"))

```

After removing missing values, 2619 customers remain for modeling.

# 3 Exploratory Data Analysis

To begin, we will make a correlation matrix with all of the numerical variables to see if any variables are highly correlated with each other.

```{r}

# 1) Manually list the numeric variables you want
vars <- c("FRE","MON", "AVRG","GMP","PROMOS","DAYS","CLASSES","STYLES","STORES","HI","PERCRET")

# 2) Keep only the ones that actually exist in your data
vars <- intersect(vars, names(CSDS_clean))

# 3) Correlation matrix (errors if any are non-numeric)
cor_mat <- cor(CSDS_clean[ , vars, drop = FALSE],
               use = "pairwise.complete.obs",
               method = "pearson")
cor_mat


```

Pairwise correlations reveal strong relationships among purchasing activity variables. For example, FRE (visits) correlates highly with STYLES (items purchased) and CLASSES (product classes), and MON (net sales) correlates very strongly with STYLES. These patterns are intuitive—customers who shop more often tend to buy more items across more classes, leading to higher sales. We retain these variables initially and address redundancy later during model refinement.

We will now conduct frequency distributions for all of the numerical variables to identify any patterns and skewness. 

```{r}

par(mfrow=c(1,2))

hist(CSDS_clean$FRE, xlab="# of Purchase Visits", main = "")

hist(CSDS_clean$MON, xlab="Total Net Sales", main = "")

hist(CSDS_clean$AVRG, xlab="Average Spend Per Visit", main = "")

hist(CSDS_clean$GMP, xlab="Gross Margin %", main = "")

hist(CSDS_clean$PROMOS, xlab="# of Marketing Promos", main = "")

hist(CSDS_clean$DAYS, xlab="Days on File", main = "")

hist(CSDS_clean$CLASSES, xlab="# of Different Products Purchased", main = "")

hist(CSDS_clean$STYLES, xlab="# of Individual Items Purchased", main = "")

hist(CSDS_clean$STORES, xlab="# of Stores Shopped At", main = "")

hist(CSDS_clean$HI, xlab="Product Uniformity", main = "")

hist(CSDS_clean$PERCRET, xlab="% of Returns", main = "")

```

None of our variables are distributed normally. Number of purchase visits, Total Net Sales, Average Spend Per Visit, Number of Marketing Promos, Number of Different Products Purchased, Number of Individual Items Purchased, Number of Stores Shopped At, Product Uniformity, and Percent of Returns are all skewed to the right. Gross Margin Percentage and Days on File are skewed left. We will apply appropriate transformations prior to modeling.

# 4 Multiple Logistic Regression Model

Before building our multiple logistic regression model, we need to address the high levels of skewness observed in many of our numerical predictors. Variables such as Total Net Sales (MON), Number of Purchase Visits (FRE), Number of Individual Items Purchased (STYLES), and Percent of Returns (PERCRET) were found to be strongly right-skewed, while Gross Margin Percentage (GMP) showed moderate left skew. To ensure more stable coefficient estimates and improve linearity with the log-odds of the response variable, we apply several transformations. Right-skewed, strictly positive variables are transformed using the log1p() function, which safely handles zeros and reduces the influence of large outliers. The percentage variable PERCRET is first converted to a proportion and then transformed using the logit (qlogis()) function, allowing it to vary smoothly on the real number line. We also drop Average Spend per Visit (AVRG) due to redundancy with MON and FRE, which helps minimize multicollinearity. The resulting dataset provides a more balanced distribution of predictors and creates a stronger foundation for modeling customer response probabilities.

```{r}


dat <- CSDS_clean %>%
  # ensure RESP is a proper factor first
  dplyr::mutate(
    RESP = case_when(
      RESP %in% c("Yes","yes",1) ~ "yes",
      RESP %in% c("No","no",0)   ~ "no",
      TRUE ~ NA_character_
    ),
    RESP = factor(RESP, levels = c("no","yes"))
  ) %>%
  # transforms
  dplyr::mutate(
    PERCRET_p     = pmin(pmax(PERCRET/100, 1e-6), 1 - 1e-6),
    PERCRET_logit = qlogis(PERCRET_p),
    dplyr::across(
      tidyselect::any_of(c("MON","FRE","STYLES","CLASSES","STORES","HI")),
      ~ log1p(.),
      .names = "{.col}_log"
    )
  ) %>%
  # select ONLY with fully-qualified dplyr
  dplyr::select(dplyr::all_of(
    c("RESP","MON_log","FRE_log","STYLES_log","CLASSES_log","STORES_log",
      "HI_log","PERCRET_logit","PROMOS","DAYS","GMP")
  )) %>%
  tidyr::drop_na()

```

Here we fit the multiple logistic regression model predicting RESP (yes/no) using the transformed predictors from Section 4.1. The specification includes log-transformed behavioral measures (e.g., MON_log, FRE_log, STYLES_log, CLASSES_log, STORES_log, HI_log), the logit-transformed return rate (PERCRET_logit), and untransformed controls (PROMOS, DAYS, GMP). This model estimates how each feature is associated with the log-odds of responding while mitigating skew and multicollinearity. We report coefficient summaries and odds ratios to aid interpretation.

```{r, message = FALSE}

# Ensure RESP exists and is a factor with correct levels
dat$RESP <- factor(dat$RESP, levels = c("no", "yes"))

# Check that it worked
table(dat$RESP)
```

## 4.1 Larger Model

We first fit a larger model, including MON_log, FRE_log, STYLES_log, CLASSES_log, STORES_log, HI_log, PERCRET_logit, PROMOS, DAYS, and GMP.

```{r}

fit <- glm(RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log +
             HI_log + PERCRET_logit + PROMOS + DAYS + GMP,
           data = dat, family = binomial)

summary(fit)
```

The logistic regression model predicts the likelihood that a customer responds (RESP = yes) to a marketing promotion based on purchasing and engagement variables. The negative intercept (–3.67, p < .001) suggests a low baseline probability of response.

The number of purchase visits (FRE_log) is the strongest and most significant predictor (β = 1.21, p < .001), meaning customers who shop more frequently are much more likely to respond. A one-unit increase in log-transformed purchase visits multiplies the odds of responding by about 3.35 times.

Number of items purchased (STYLES_log) shows a weak positive effect (p = .07), while number of promotions (PROMOS) has a small negative effect (β = –0.025, p = .04), suggesting that overexposure to promotions slightly reduces responsiveness.

All other variables were not statistically significant, indicating that frequent shoppers are the most reliable predictors of promotional response.

Before shortening our model, let's see if there is multicollinearity within our predictor variables. We will run a VIF matrix. 

```{r}

vif(fit)

```

Variance inflation factors (VIF) indicate substantial multicollinearity among purchase-activity variables: MON_log, STYLES_log, and CLASSES_log are inflated (VIF > 10), and FRE_log is moderately inflated. Because MON_log and CLASSES_log are also not statistically significant, we remove them. We keep STYLES_log temporarily to reassess its signal after reducing multicollinearity.

## 4.2 Reduced Model

Here is our improved, new fitted model:

```{r}

# Fit the reduced model
fit2 <- glm(RESP ~ FRE_log + PROMOS + STYLES_log,
            data = dat, family = binomial)

# Inspect results
summary(fit2)

```

All predictors are statistically significant at the 0.01 level, indicating that shopping frequency, purchase variety, and promotion exposure meaningfully influence response likelihood.

FRE_log (β = 1.19, p < .001) → Customers who visit the store more frequently are much more likely to respond to a promotion. Each one-unit increase in log-transformed purchase frequency increases the odds of response by about 3.3× (e¹·¹⁹).

PROMOS (β = –0.033, p = .002) → Each additional promotion on file slightly reduces the odds of responding, suggesting that over-exposure to promotions may lead to lower engagement.

STYLES_log (β = 0.43, p = .002) → Customers who buy a greater number of individual items are also more likely to respond. A one-unit increase in log-transformed styles purchased increases the odds of response by about 1.5× (e⁰·⁴³).

Let's take a look at the VIF matrix of our new fitted model to see if there's still any multicollinearity.

```{r}

vif(fit2)

```

All three VIF values are below 5, meaning multicollinearity is no longer a serious issue.

## 4.3 Performance Comparison

We will now use deviance and AIC to assess the performance of our two models. We are looking for the model with the lowest AIC value. Deviance allows us to measure differences between the model and observed values. A low AIC value indicates a better balance between goodness-of-fit and model complexity.

```{r}

## Other global goodness-of-fit
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(fit = global.measure(fit),
      fit2=global.measure(fit2)
     )
row.names(goodness) = c("larger.model", "reduced.model")
pander(goodness, caption ="Comparison of global goodness-of-fit statistics")
```

We compare model fit using AIC and deviance. The reduced model achieves a lower AIC (≈ 1806) than the larger model (≈ 1813), indicating a better balance of fit and parsimony. While the reduced model’s residual deviance is slightly higher (worse raw fit), AIC penalizes unnecessary complexity—thus, overall, the reduced model is preferred.

## 4.4 Likelihood Ratio Chi-Squared Test

We test whether the larger model improves fit beyond the reduced model. The deviance difference is:

ΔG2=Devianc$ereduced$−Deviance$larger$=1798−1791=7

with df = 10 − 3 = 7 (difference in parameters). This yields a non-significant LRT (p ≈ 0.42), so the larger model does not provide a statistically meaningful improvement. We therefore retain the reduced model.

# 5 Final Model

Above, we explored two potential models that could predict whether a customer responds to a mail marketing promotion. Our first model, the larger model, consisted of all the numerical variables in the dataset. Our second model, the reduced model, consisted of only our VIF-free, statistically significant variables. These variables were FRE_log (number of purchase visits), number of marketing promotions on file (PROMOS), and number of different items purchased (STYLES_log). Here are our summary statistics for our final model, but this time we will include odds ratios to fully encapsulate the impact of this model.

```{r}
model.coef.stats <- summary(fit2)$coef              # Estimate, SE, z, p
odds.ratio       <- exp(coef(fit2))                 # exponentiate coefficients

out.stats <- cbind(model.coef.stats, `Odds Ratio` = odds.ratio)
knitr::kable(out.stats, caption = "Summary Stats with Odds Ratios")
```

Predictor	Interpretation
(Intercept) = –4.52, OR = 0.01, p < .001	The baseline odds of responding are very low when all predictors are at zero (i.e., for an infrequent shopper with no promotions and few purchases).

FRE_log = 1.19, OR = 3.28, p < .001	Each one-unit increase in log-transformed purchase frequency multiplies the odds of responding by about 3.3×, making it the strongest predictor. Frequent shoppers are much more likely to respond.

PROMOS = –0.033, OR = 0.97, p = .002	Each additional promotion slightly reduces the odds of responding by about 3%, suggesting that too many promotions may lead to fatigue or diminished engagement.

STYLES_log = 0.43, OR = 1.54, p = .002	Each one-unit increase in log-transformed number of items purchased increases the odds of responding by about 54%, indicating that customers who buy a wider variety of items tend to be more responsive.

# 6 Conclusion

We set out to identify which behaviors best predict response to a direct mail promotion. After cleaning, transforming, and addressing multicollinearity, a parsimonious logistic model with FRE_log, STYLES_log, and PROMOS outperformed the larger specification on AIC and was not significantly worse by an LRT. Results show that shopping frequency and breadth of items purchased are strong, positive drivers of response, whereas promotion saturation slightly reduces engagement. These findings suggest a practical strategy: prioritize outreach to high-frequency, high-variety shoppers while moderating promotion volume to avoid fatigue, thereby improving targeting efficiency and campaign lift.