1 Introduction
This project builds a data-driven framework to identify which
customers are most likely to respond to a direct-mail promotion for a
regional clothing retailer. This project uses the Clothing Store
Dataset—a real dataset from a New England retail chain. I first explore
purchasing, engagement, and account-history patterns to understand data
quality, distributional shape, and relationships among key variables.
Based on those findings, I prepare the data by recoding the response,
addressing skewed predictors with appropriate transformations, and
checking for redundancy among highly correlated shopping-activity
measures.
With the data curated, I develop a series of logistic regression
models to link customer characteristics to response propensity. I
compare a comprehensive “larger” specification to a more parsimonious
“reduced” alternative, using variance inflation factors to monitor
multicollinearity and cross-validation to select among candidates on
predictive grounds. The analysis plan includes a train/test split for
honest out-of-sample evaluation, threshold-based classification metrics
(e.g., accuracy/PE), and clear interpretation of effect sizes through
odds ratios. The goal is not just to fit a model, but to produce a
transparent, defensible approach that a marketing team could use to
prioritize outreach and improve campaign efficiency.
Research Question: Can customer purchasing and
engagement behaviors be used to predict whether a customer will respond
to a direct-mail marketing promotion?
2 Description of Dataset
These are 20 explanatory variables in this data set: 1. HHKEY
(categorical) → Unique encrypted customer ID 2. ZIP_CODE (categorical) →
Customer’s ZIP code 3. FRE (numerical) → Number of purchase visits 4.
MON (numerical) → Total net sales 5. AVRG (numerical) → Average spend
per visit 6. AMSPEND, PSSPEND, CSSPEND, AXSPEND (categorical) → Spend
across four different franchise brands 7. OMONSPEND, TMONSPEND, SMOSPEND
(categorical) → Spend over past 1, 3, and 6 months 8. PREVPD
(categorical) → Spend in the same period last year 9. GMP (numerical) →
Gross margin percentage 10. PROMOS (numerical) → Number of marketing
promotions on file 11. DAYS (numerical) → Number of days customer has
been on file 12. FREDAYS, LTFREDAY (categorical) → Time between
purchases (recent & lifetime average) 13. CLASSES (numerical) →
Number of different product classes purchased 14. STYLES (numerical) →
Number of individual items purchased 15. STORES (numerical) → Number of
stores shopped at 16. MARKDOWN, COUPONS, MAILED, RESPONDE, RESPONSERATE
(categorical) → Promotion and discount engagement 17. HI (numerical) →
Product uniformity (lower = more diverse purchases) 18. CLUSTYPE
(categorical) → Lifestyle cluster type (encrypted) 19. PERCRET
(numerical) → Percent of returns 20. CC_CARD, VALPHON, WEB (categorical)
→ Flags for credit card, valid phone number, and web shopper status
Our response variable is:
RESP → whether a customer responded to a promotion (yes/no)
The dataset originates from the customer relationship management
system of a regional New England clothing retailer. It includes
transactional, demographic, and behavioral information used for targeted
marketing analysis. Each record represents an individual customer and
their corresponding promotional response. We have 2847 observations, or
customers, in this dataset. We will now remove all observations with
missing values.
CSDS_clean <- na.omit(CSDS)
response <- factor(CSDS_clean$RESP, levels = c("no","yes"))
After removing missing values, 2619 customers remain for
modeling.
3 Exploratory Data Analysis
We will make a correlation matrix with all of the numerical variables
to see if any variables are highly correlated with each other. We will
be dropping the categorical variables because they are all either
redundant to what we have with the numerical variables, or are
irrelevant to the question.
vars <- c("FRE","MON", "AVRG","GMP","PROMOS","DAYS","CLASSES","STYLES","STORES","HI","PERCRET")
vars <- intersect(vars, names(CSDS_clean))
cor_mat <- cor(CSDS_clean[ , vars, drop = FALSE],
use = "pairwise.complete.obs",
method = "pearson")
cor_mat
FRE MON AVRG GMP PROMOS
FRE 1.0000000 0.674209463 -0.152217058 -0.114920260 0.46902853
MON 0.6742095 1.000000000 0.271555910 0.038859950 0.38456147
AVRG -0.1522171 0.271555910 1.000000000 0.287400964 0.04592048
GMP -0.1149203 0.038859950 0.287400964 1.000000000 0.05949538
PROMOS 0.4690285 0.384561473 0.045920482 0.059495381 1.00000000
DAYS 0.3910812 0.261051031 -0.122708281 -0.021326578 0.60409827
CLASSES 0.7948576 0.675174039 0.087937020 -0.105308252 0.52413518
STYLES 0.7931651 0.935759837 0.135948828 -0.076199344 0.40263287
STORES 0.5909190 0.430123659 -0.088488953 -0.103484318 0.37306720
HI -0.4323462 -0.309591043 -0.002372335 0.150835861 -0.33013811
PERCRET 0.1559934 0.009115226 -0.180623484 -0.007074185 0.04475647
DAYS CLASSES STYLES STORES HI
FRE 0.39108118 0.79485761 0.79316506 0.59091903 -0.432346183
MON 0.26105103 0.67517404 0.93575984 0.43012366 -0.309591043
AVRG -0.12270828 0.08793702 0.13594883 -0.08848895 -0.002372335
GMP -0.02132658 -0.10530825 -0.07619934 -0.10348432 0.150835861
PROMOS 0.60409827 0.52413518 0.40263287 0.37306720 -0.330138107
DAYS 1.00000000 0.40606251 0.29344949 0.32782601 -0.277643174
CLASSES 0.40606251 1.00000000 0.75321727 0.60280152 -0.619839913
STYLES 0.29344949 0.75321727 1.00000000 0.51226171 -0.360716264
STORES 0.32782601 0.60280152 0.51226171 1.00000000 -0.414095001
HI -0.27764317 -0.61983991 -0.36071626 -0.41409500 1.000000000
PERCRET 0.03418407 0.10271670 0.06785920 0.09372079 -0.252625218
PERCRET
FRE 0.155993396
MON 0.009115226
AVRG -0.180623484
GMP -0.007074185
PROMOS 0.044756473
DAYS 0.034184072
CLASSES 0.102716699
STYLES 0.067859197
STORES 0.093720788
HI -0.252625218
PERCRET 1.000000000
Pairwise correlations reveal strong relationships among purchasing
activity variables. For example, FRE (visits) correlates highly with
STYLES (items purchased) and CLASSES (product classes), and MON (net
sales) correlates very strongly with STYLES. These patterns are
intuitive; customers who shop more often tend to buy more items across
more classes, leading to higher sales. We retain these variables
initially and address redundancy later during model refinement.
We will now conduct frequency distributions for all of the numerical
variables to identify any patterns and skewness.
par(mfrow=c(1,2))
hist(CSDS_clean$FRE, xlab="# of Purchase Visits", main = "")
hist(CSDS_clean$MON, xlab="Total Net Sales", main = "")

hist(CSDS_clean$AVRG, xlab="Average Spend Per Visit", main = "")
hist(CSDS_clean$GMP, xlab="Gross Margin %", main = "")

hist(CSDS_clean$PROMOS, xlab="# of Marketing Promos", main = "")
hist(CSDS_clean$DAYS, xlab="Days on File", main = "")

hist(CSDS_clean$CLASSES, xlab="# of Different Products Purchased", main = "")
hist(CSDS_clean$STYLES, xlab="# of Individual Items Purchased", main = "")

hist(CSDS_clean$STORES, xlab="# of Stores Shopped At", main = "")
hist(CSDS_clean$HI, xlab="Product Uniformity", main = "")

hist(CSDS_clean$PERCRET, xlab="% of Returns", main = "")

None of our variables are distributed normally. Number of purchase
visits, Total Net Sales, Average Spend Per Visit, Number of Marketing
Promos, Number of Different Products Purchased, Number of Individual
Items Purchased, Number of Stores Shopped At, Product Uniformity, and
Percent of Returns are all skewed to the right. Gross Margin Percentage
and Days on File are skewed left. We will apply appropriate
transformations (log1p() and qlogis()) prior to modeling. But, we will
not be discretizing the variables because they are all significantly
skewed. We will also not be standardizing the variables because log1p()
and qlogis() handle the issue of skewness and nonlinear
relationships.
4 Data Split - Training and Testing Data
We first split the data into 70% training and 30% testing. All
transformations, feature selection, and model tuning are performed only
on the training set to avoid data leakage. The test set remains
untouched and is used once at the end to estimate out-of-sample
performance.
## splitting data: 70% training and 30% testing
n <- dim(CSDS_clean)[1]
train.n <- round(0.7*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- CSDS_clean[train.id, ]
test <- CSDS_clean[-train.id, ]
5 Data Transformation and Candidate Models
In this section, we transform the training data to address skewness
and then develop candidate logistic regression models, comparing them on
fit and parsimony before settling on a final specification.
5.2 Larger Model
We begin with a larger model containing all major transformed
numerical predictors to capture the full set of potential drivers of
response.
larger <- glm( RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP, data = train_dat, family = binomial )
summary(larger)
Call:
glm(formula = RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log +
STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP,
family = binomial, data = train_dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.1999993 0.9678648 -3.306 0.000946 ***
MON_log -0.0065322 0.2574462 -0.025 0.979757
FRE_log 1.4237810 0.2445699 5.822 5.83e-09 ***
STYLES_log 0.5900700 0.3271112 1.804 0.071250 .
CLASSES_log -0.4153607 0.3824896 -1.086 0.277506
STORES_log 0.0390271 0.2368922 0.165 0.869144
HI_log -0.2155549 0.2210891 -0.975 0.329576
PERCRET_logit -0.0187766 0.0281758 -0.666 0.505150
PROMOS -0.0374719 0.0152068 -2.464 0.013734 *
DAYS -0.0008740 0.0005152 -1.697 0.089783 .
GMP -0.6675410 0.5612068 -1.189 0.234253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1535.9 on 1832 degrees of freedom
Residual deviance: 1229.4 on 1822 degrees of freedom
AIC: 1251.4
Number of Fisher Scoring iterations: 5
The initial logistic regression model predicts the likelihood that a
customer responds to a marketing promotion based on a comprehensive set
of purchasing, engagement, and demographic variables. This model
includes ten predictors: MON_log, FRE_log, STYLES_log, CLASSES_log,
STORES_log, HI_log, PERCRET_logit, PROMOS, DAYS, and GMP.
FRE_log (β = 1.42, p < .001): Each one-unit increase
multiplies the odds of response by about 4.1 times.
PROMOS (β = –0.037, p = .014): Each additional promotion slightly
reduces response odds (≈ –4%).
STYLES_log (β = 0.59, p = .07): Positive but marginally
significant.
All other predictors are not statistically significant (p >
.10).
The model’s residual deviance (1229.4) and AIC (1251.4) reflect a
strong overall fit relative to the null deviance (1535.9). However,
multicollinearity diagnostics revealed inflated VIF values among several
purchasing activity variables, such as MON_log, CLASSES_log, and
STYLES_log. To address redundancy and improve interpretability, a
reduced model was developed containing only the key predictors —
FRE_log, PROMOS, and STYLES_log — which produced a more parsimonious and
stable model without loss of explanatory power. Before shortening our
model, let’s see if there is multicollinearity within our predictor
variables. We will run a VIF matrix.
MON_log FRE_log STYLES_log CLASSES_log STORES_log
12.044901 5.518902 15.808105 9.951538 1.519444
HI_log PERCRET_logit PROMOS DAYS GMP
3.555838 2.066177 2.270283 1.862572 1.704003
Variance inflation factors (VIF) reveal substantial multicollinearity
among purchase-activity variables—especially STYLES_log (≈15.8), MON_log
(≈12.0), and CLASSES_log (≈10.0)—which mirrors the very high
correlations observed earlier (e.g., MON–STYLES ≈ 0.94, FRE–STYLES ≈
0.79). Because several of these variables are not statistically
significant once frequency is included, we remove redundant predictors
and proceed to a reduced model that improves interpretability and
numerical stability.
5.3 Reduced Model
Guided by significance and multicollinearity diagnostics, the reduced
model focuses on the key predictors FRE_log, PROMOS, and STYLES_log.
reduced <- glm(
RESP ~ FRE_log + PROMOS + STYLES_log,
data = train_dat, family = binomial
)
summary(reduced)
Call:
glm(formula = RESP ~ FRE_log + PROMOS + STYLES_log, family = binomial,
data = train_dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.57821 0.24982 -18.326 < 2e-16 ***
FRE_log 1.35110 0.21459 6.296 3.05e-10 ***
PROMOS -0.05475 0.01302 -4.205 2.61e-05 ***
STYLES_log 0.45571 0.16893 2.698 0.00698 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1535.9 on 1832 degrees of freedom
Residual deviance: 1236.3 on 1829 degrees of freedom
AIC: 1244.3
Number of Fisher Scoring iterations: 5
The final logistic regression model predicts the likelihood that a
customer responds to a marketing promotion based on purchase frequency
(FRE_log), number of promotions (PROMOS), and number of unique items
purchased (STYLES_log).
FRE_log (β = 1.35, p < .001): A one-unit increase in
log-transformed purchase frequency multiplies the odds of response by
approximately 3.9× (e¹·³⁵ ≈ 3.86).
PROMOS (β = –0.055, p < .001): Each additional promotion on
file slightly reduces the odds of response by about 5% (e⁻⁰·⁰⁵⁵ ≈
0.95).
STYLES_log (β = 0.46, p = .007): A one-unit increase in
log-transformed styles purchased increases the odds of response by
roughly 1.6× (e⁰·⁴⁶ ≈ 1.58).
The model’s residual deviance (1236.3) and AIC (1244.3) represent a
strong improvement over the null deviance (1535.9), confirming good
model fit. Overall, this reduced model is parsimonious, statistically
robust, and interpretable, highlighting that shopping frequency and
purchase diversity are strong positive predictors of promotional
response, while excessive promotion frequency slightly dampens it.
FRE_log PROMOS STYLES_log
4.286025 1.656890 4.286738
All three VIF values are below 5, meaning multicollinearity is no
longer a serious issue.
6 Cross-Validation for Best Model Identification
Here, we use 5-fold cross validation to make sure our data set has
enough observations, or customers.
## 5-fold cross-validation on TRAINING DATA (train_dat) ## Requires: library(MASS); library(pander) set.seed(123)
k <- 5
fold.size <- floor(nrow(train_dat) / k)
# Keep columns so '.' in formula means exactly these predictors
train_cols <- c("RESP","MON_log","FRE_log","STYLES_log","CLASSES_log", "STORES_log","HI_log","PERCRET_logit","PROMOS","DAYS","GMP")
PE1 <- numeric(k) # full
PE2 <- numeric(k) # stepAIC (forward)
PE3 <- numeric(k) # reduced
for (i in 1:k) {
# Validation and training folds (simple contiguous folds)
valid.id <- (fold.size*(i-1)+1):(fold.size*i)
valid <- train_dat[valid.id, train_cols, drop = FALSE]
train_cv <- train_dat[-valid.id, train_cols, drop = FALSE]
## full model (matches your larger spec)
larger <- glm( RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP, data = train_cv, family = binomial )
## reduced model (your trio)
reduced <- glm( RESP ~ FRE_log + PROMOS + STYLES_log,data = train_cv, family = binomial )
## forward step between reduced (lower) and full (upper)
model2 <- MASS::stepAIC( larger, scope = list(lower = formula(reduced), upper = formula(larger)), direction = "forward", trace = 0 )
## predicted probabilities on validation fold
pred01 <- predict(larger, newdata = valid, type = "response")
pred02 <- predict(model2, newdata = valid, type = "response")
pred03 <- predict(reduced, newdata = valid, type = "response")
## classify with threshold 0.4 (to match your template)
yhat01 <- ifelse(pred01 > 0.4, "yes", "no")
yhat02 <- ifelse(pred02 > 0.4, "yes", "no")
yhat03 <- ifelse(pred03 > 0.4, "yes", "no")
## misclassification error vs true RESP
PE1[i] <- mean(yhat01 != as.character(valid$RESP))
PE2[i] <- mean(yhat02 != as.character(valid$RESP))
PE3[i] <- mean(yhat03 != as.character(valid$RESP)) }
avg.pe <- cbind(PE.1_Full = mean(PE1),
PE.2_Step = mean(PE2),
PE.3_Reduced = mean(PE3))
pander::pander(avg.pe, caption = "5-fold CV: Average prediction errors (lower is better)")
5-fold CV: Average prediction errors (lower is
better)
| 0.1454 |
0.1454 |
0.1448 |
In 5-fold CV on the training data, prediction errors were 0.1454
(full), 0.1454 (stepwise), and 0.1448 (reduced), indicating comparable
performance; we select the reduced model for its slightly lower error
and improved interpretability.
7 Conclusion
The purpose of this project was to determine which customer
characteristics most effectively predict response to a direct-mail
marketing promotion. Through exploratory analysis, we found that
purchasing behavior variables—such as purchase frequency, number of
styles purchased, and total sales—were highly correlated and strongly
skewed. To improve model stability, these predictors were
log-transformed, and redundant variables were removed based on both
multicollinearity (VIF) and statistical insignificance.
Two logistic regression models were evaluated: a larger model
including all transformed numeric predictors, and a reduced model
focusing on the most meaningful predictors (FRE_log, STYLES_log, and
PROMOS). Both models achieved similar predictive accuracy in 5-fold
cross-validation, with the reduced model producing the lowest average
prediction error (0.1448). Because it provided nearly identical
performance with fewer parameters and no multicollinearity concerns, the
reduced model was selected as the final specification.
The final model reveals that customers who shop more frequently and
purchase a wider variety of items are significantly more likely to
respond to a marketing promotion, while those who receive more
promotions tend to respond less, possibly due to saturation or fatigue.
Overall, this analysis successfully answers the research question: yes,
certain behavioral characteristics—especially shopping frequency and
product variety—can reliably predict whether a customer will respond to
a direct-mail promotion. These insights can guide future marketing
strategies by helping retailers better target high-engagement customers
and refine their promotional frequency for optimal impact.
---
title: "Modeling Customer Engagement in Retail Promotions Through Logistic Regression"
author: "Luke Volm"
date: "2025-10-10"
output:
  html_document:           # output document format
    toc: yes               # add table contents
    toc_float: yes         # toc_property: floating
    toc_depth: 4           # depth of TOC headings
    fig_width: 6           # global figure width
    fig_height: 4          # global figure height
    fig_caption: yes       # add figure caption
    number_sections: no   # numbering section headings
    toc_collapsed: yes     # TOC subheading collapsing
    code_folding: hide     # folding/showing code 
    code_download: yes     # allow to download complete RMarkdown source code
    smooth_scroll: yes     # scrolling text of the document
    theme: lumen           # visual theme for HTML document only
    highlight: tango       # code syntax highlighting styles
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
  word_document:
    toc: yes
    toc_depth: '4'
---

```{css, echo = FALSE}
div#TOC lwe{     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
library(readxl)
library(boot)
library(dplyr)
library(knitr)
library(psych)
library(MASS)
library(tidyr)
library(ggplot2)
library(car)
library(pander)

# Set seed for reproducibility
set.seed(123)

# Read in data (drop first column if it's just an index/ID)
setwd("C:/Users/volm1/OneDrive/Desktop/STA321 new")
CSDS <- read.csv("Clothing Store Data Set.csv")
# Global chunk options
knitr::opts_chunk$set(
  echo = TRUE,      # show code
  warning = FALSE,  # suppress warnings
  message = FALSE,  # suppress messages
  results = TRUE,   # show results
  comment = NA      # cleaner output (no "##" prefix)
)
```

# 1 Introduction

This project builds a data-driven framework to identify which customers are most likely to respond to a direct-mail promotion for a regional clothing retailer. This project uses the Clothing Store Dataset—a real dataset from a New England retail chain. I first explore purchasing, engagement, and account-history patterns to understand data quality, distributional shape, and relationships among key variables. Based on those findings, I prepare the data by recoding the response, addressing skewed predictors with appropriate transformations, and checking for redundancy among highly correlated shopping-activity measures.

With the data curated, I develop a series of logistic regression models to link customer characteristics to response propensity. I compare a comprehensive “larger” specification to a more parsimonious “reduced” alternative, using variance inflation factors to monitor multicollinearity and cross-validation to select among candidates on predictive grounds. The analysis plan includes a train/test split for honest out-of-sample evaluation, threshold-based classification metrics (e.g., accuracy/PE), and clear interpretation of effect sizes through odds ratios. The goal is not just to fit a model, but to produce a transparent, defensible approach that a marketing team could use to prioritize outreach and improve campaign efficiency.

**Research Question**: Can customer purchasing and engagement behaviors be used to predict whether a customer will respond to a direct-mail marketing promotion?

# 2 Description of Dataset

These are 20 explanatory variables in this data set: 
1. HHKEY (categorical) → Unique encrypted customer ID
2. ZIP_CODE (categorical) → Customer’s ZIP code
3. FRE (numerical) → Number of purchase visits
4. MON (numerical) → Total net sales
5. AVRG (numerical) → Average spend per visit
6. AMSPEND, PSSPEND, CSSPEND, AXSPEND (categorical) → Spend across four different franchise brands
7. OMONSPEND, TMONSPEND, SMOSPEND (categorical) → Spend over past 1, 3, and 6 months
8. PREVPD (categorical) → Spend in the same period last year
9. GMP (numerical) → Gross margin percentage
10. PROMOS (numerical) → Number of marketing promotions on file
11. DAYS (numerical) → Number of days customer has been on file
12. FREDAYS, LTFREDAY (categorical) → Time between purchases (recent & lifetime average)
13. CLASSES (numerical) → Number of different product classes purchased
14. STYLES (numerical) → Number of individual items purchased
15. STORES (numerical) → Number of stores shopped at
16. MARKDOWN, COUPONS, MAILED, RESPONDE, RESPONSERATE (categorical) → Promotion and discount engagement
17. HI (numerical) → Product uniformity (lower = more diverse purchases)
18. CLUSTYPE (categorical) → Lifestyle cluster type (encrypted)
19. PERCRET (numerical) → Percent of returns
20. CC_CARD, VALPHON, WEB (categorical) → Flags for credit card, valid phone number, and web shopper status

Our response variable is:

RESP → whether a customer responded to a promotion (yes/no)

The dataset originates from the customer relationship management system of a regional New England clothing retailer. It includes transactional, demographic, and behavioral information used for targeted marketing analysis. Each record represents an individual customer and their corresponding promotional response. We have 2847 observations, or customers, in this dataset. We will now remove all observations with missing values.

```{r}

CSDS_clean <- na.omit(CSDS)
response <- factor(CSDS_clean$RESP, levels = c("no","yes"))
```

After removing missing values, 2619 customers remain for modeling.

# 3 Exploratory Data Analysis

We will make a correlation matrix with all of the numerical variables to see if any variables are highly correlated with each other. We will be dropping the categorical variables because they are all either redundant to what we have with the numerical variables, or are irrelevant to the question.

```{r}

vars <- c("FRE","MON", "AVRG","GMP","PROMOS","DAYS","CLASSES","STYLES","STORES","HI","PERCRET")
vars <- intersect(vars, names(CSDS_clean))
cor_mat <- cor(CSDS_clean[ , vars, drop = FALSE],
use = "pairwise.complete.obs",
method = "pearson")
cor_mat

```

Pairwise correlations reveal strong relationships among purchasing activity variables. For example, FRE (visits) correlates highly with STYLES (items purchased) and CLASSES (product classes), and MON (net sales) correlates very strongly with STYLES. These patterns are intuitive; customers who shop more often tend to buy more items across more classes, leading to higher sales. We retain these variables initially and address redundancy later during model refinement.

We will now conduct frequency distributions for all of the numerical variables to identify any patterns and skewness.

```{r}

par(mfrow=c(1,2))
hist(CSDS_clean$FRE, xlab="# of Purchase Visits", main = "")
hist(CSDS_clean$MON, xlab="Total Net Sales", main = "")
hist(CSDS_clean$AVRG, xlab="Average Spend Per Visit", main = "")
hist(CSDS_clean$GMP, xlab="Gross Margin %", main = "")
hist(CSDS_clean$PROMOS, xlab="# of Marketing Promos", main = "")
hist(CSDS_clean$DAYS, xlab="Days on File", main = "")
hist(CSDS_clean$CLASSES, xlab="# of Different Products Purchased", main = "")
hist(CSDS_clean$STYLES, xlab="# of Individual Items Purchased", main = "")
hist(CSDS_clean$STORES, xlab="# of Stores Shopped At", main = "")
hist(CSDS_clean$HI, xlab="Product Uniformity", main = "")
hist(CSDS_clean$PERCRET, xlab="% of Returns", main = "")

```

None of our variables are distributed normally. Number of purchase visits, Total Net Sales, Average Spend Per Visit, Number of Marketing Promos, Number of Different Products Purchased, Number of Individual Items Purchased, Number of Stores Shopped At, Product Uniformity, and Percent of Returns are all skewed to the right. Gross Margin Percentage and Days on File are skewed left. We will apply appropriate transformations (log1p() and qlogis()) prior to modeling. But, we will not be discretizing the variables because they are all significantly skewed. We will also not be standardizing the variables because log1p() and qlogis() handle the issue of skewness and nonlinear relationships.

# 4 Data Split - Training and Testing Data

We first split the data into 70% training and 30% testing. All transformations, feature selection, and model tuning are performed only on the training set to avoid data leakage. The test set remains untouched and is used once at the end to estimate out-of-sample performance.

```{r}

## splitting data: 70% training and 30% testing 
n <- dim(CSDS_clean)[1] 
train.n <- round(0.7*n) 
train.id <- sample(1:n, train.n, replace = FALSE) 
## training and testing data sets 
train <- CSDS_clean[train.id, ] 
test <- CSDS_clean[-train.id, ]

```

# 5 Data Transformation and Candidate Models

In this section, we transform the training data to address skewness and then develop candidate logistic regression models, comparing them on fit and parsimony before settling on a final specification.

## 5.1 Data Transformation

Several predictors (spending and purchasing activity) are highly skewed. To improve the approximate linearity between predictors and the log-odds of response, we apply log1p to right-skewed activity variables and a logit transform to the bounded return rate. These transformations reduce the influence of extreme values and stabilize variance while preserving continuous information. Importantly, all transformations are fit and applied on the training set only; the test set is transformed later with the same rules for a fair evaluation.

```{r}

train_dat <- train %>%
dplyr::mutate(
RESP = dplyr::case_when(
RESP %in% c("Yes","yes",1) ~ "yes",
RESP %in% c("No","no",0)   ~ "no",
TRUE ~ NA_character_
),
RESP = factor(RESP, levels = c("no","yes")),
PERCRET_p     = pmin(pmax(PERCRET / 100, 1e-6), 1 - 1e-6),
PERCRET_logit = qlogis(PERCRET_p),
dplyr::across(
tidyselect::any_of(c("MON","FRE","STYLES","CLASSES","STORES","HI")),
~ log1p(.),
.names = "{.col}_log"
)
) %>%
dplyr::select(tidyselect::all_of(c(
"RESP","MON_log","FRE_log","STYLES_log","CLASSES_log","STORES_log",
"HI_log","PERCRET_logit","PROMOS","DAYS","GMP"
))) %>%
tidyr::drop_na()
```

Summary of variable handling:

* **Transformed (log1p)**: MON, FRE, STYLES, CLASSES, STORES, HI

* **Transformed (logit)**: PERCRET

* **Standardized**: None

* **Discretized**: None

## 5.2 Larger Model

We begin with a larger model containing all major transformed numerical predictors to capture the full set of potential drivers of response.

```{r}

larger <- glm( RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP, data = train_dat, family = binomial ) 

summary(larger)
```

The initial logistic regression model predicts the likelihood that a customer responds to a marketing promotion based on a comprehensive set of purchasing, engagement, and demographic variables. This model includes ten predictors: MON_log, FRE_log, STYLES_log, CLASSES_log, STORES_log, HI_log, PERCRET_logit, PROMOS, DAYS, and GMP.

* FRE_log (β = 1.42, p < .001): Each one-unit increase multiplies the odds of response by about 4.1 times.

* PROMOS (β = –0.037, p = .014): Each additional promotion slightly reduces response odds (≈ –4%).

* STYLES_log (β = 0.59, p = .07): Positive but marginally significant.

All other predictors are not statistically significant (p > .10).

The model’s residual deviance (1229.4) and AIC (1251.4) reflect a strong overall fit relative to the null deviance (1535.9). However, multicollinearity diagnostics revealed inflated VIF values among several purchasing activity variables, such as MON_log, CLASSES_log, and STYLES_log. To address redundancy and improve interpretability, a reduced model was developed containing only the key predictors — FRE_log, PROMOS, and STYLES_log — which produced a more parsimonious and stable model without loss of explanatory power. Before shortening our model, let's see if there is multicollinearity within our predictor variables. We will run a VIF matrix.

```{r}

vif(larger)

```

Variance inflation factors (VIF) reveal substantial multicollinearity among purchase-activity variables—especially STYLES_log (≈15.8), MON_log (≈12.0), and CLASSES_log (≈10.0)—which mirrors the very high correlations observed earlier (e.g., MON–STYLES ≈ 0.94, FRE–STYLES ≈ 0.79). Because several of these variables are not statistically significant once frequency is included, we remove redundant predictors and proceed to a reduced model that improves interpretability and numerical stability.

## 5.3 Reduced Model

Guided by significance and multicollinearity diagnostics, the reduced model focuses on the key predictors FRE_log, PROMOS, and STYLES_log.

```{r}

reduced <- glm(
RESP ~ FRE_log + PROMOS + STYLES_log,
data = train_dat, family = binomial
)
summary(reduced)

```

The final logistic regression model predicts the likelihood that a customer responds to a marketing promotion based on purchase frequency (FRE_log), number of promotions (PROMOS), and number of unique items purchased (STYLES_log).

* FRE_log (β = 1.35, p < .001): A one-unit increase in log-transformed purchase frequency multiplies the odds of response by approximately 3.9× (e¹·³⁵ ≈ 3.86).

* PROMOS (β = –0.055, p < .001): Each additional promotion on file slightly reduces the odds of response by about 5% (e⁻⁰·⁰⁵⁵ ≈ 0.95).

* STYLES_log (β = 0.46, p = .007): A one-unit increase in log-transformed styles purchased increases the odds of response by roughly 1.6× (e⁰·⁴⁶ ≈ 1.58).

The model’s residual deviance (1236.3) and AIC (1244.3) represent a strong improvement over the null deviance (1535.9), confirming good model fit. Overall, this reduced model is parsimonious, statistically robust, and interpretable, highlighting that shopping frequency and purchase diversity are strong positive predictors of promotional response, while excessive promotion frequency slightly dampens it.

```{r}

vif(reduced)

```

All three VIF values are below 5, meaning multicollinearity is no longer a serious issue.

# 6 Cross-Validation for Best Model Identification

Here, we use 5-fold cross validation to make sure our data set has enough observations, or customers.

```{r}

## 5-fold cross-validation on TRAINING DATA (train_dat) ## Requires: library(MASS); library(pander) set.seed(123) 

k <- 5 
fold.size <- floor(nrow(train_dat) / k) 

# Keep columns so '.' in formula means exactly these predictors

train_cols <- c("RESP","MON_log","FRE_log","STYLES_log","CLASSES_log", "STORES_log","HI_log","PERCRET_logit","PROMOS","DAYS","GMP") 
PE1 <- numeric(k) # full 

PE2 <- numeric(k) # stepAIC (forward) 

PE3 <- numeric(k) # reduced

for (i in 1:k) { 
  
# Validation and training folds (simple contiguous folds) 
  
valid.id <- (fold.size*(i-1)+1):(fold.size*i) 
valid <- train_dat[valid.id, train_cols, drop = FALSE] 
train_cv <- train_dat[-valid.id, train_cols, drop = FALSE] 

## full model (matches your larger spec) 

larger <- glm( RESP ~ MON_log + FRE_log + STYLES_log + CLASSES_log + STORES_log + HI_log + PERCRET_logit + PROMOS + DAYS + GMP, data = train_cv, family = binomial ) 

## reduced model (your trio) 

reduced <- glm( RESP ~ FRE_log + PROMOS + STYLES_log,data = train_cv, family = binomial ) 

## forward step between reduced (lower) and full (upper) 

model2 <- MASS::stepAIC( larger, scope = list(lower = formula(reduced), upper = formula(larger)), direction = "forward", trace = 0 )

## predicted probabilities on validation fold 

pred01 <- predict(larger, newdata = valid, type = "response") 

pred02 <- predict(model2, newdata = valid, type = "response") 

pred03 <- predict(reduced, newdata = valid, type = "response") 

## classify with threshold 0.4 (to match your template) 

yhat01 <- ifelse(pred01 > 0.4, "yes", "no") 
yhat02 <- ifelse(pred02 > 0.4, "yes", "no") 
yhat03 <- ifelse(pred03 > 0.4, "yes", "no") 
## misclassification error vs true RESP 
PE1[i] <- mean(yhat01 != as.character(valid$RESP)) 
PE2[i] <- mean(yhat02 != as.character(valid$RESP)) 
PE3[i] <- mean(yhat03 != as.character(valid$RESP)) } 

avg.pe <- cbind(PE.1_Full = mean(PE1), 
  PE.2_Step = mean(PE2), 
  PE.3_Reduced = mean(PE3)) 
pander::pander(avg.pe, caption = "5-fold CV: Average prediction errors (lower is better)")

```

In 5-fold CV on the training data, prediction errors were 0.1454 (full), 0.1454 (stepwise), and 0.1448 (reduced), indicating comparable performance; we select the reduced model for its slightly lower error and improved interpretability.

# 7 Conclusion

The purpose of this project was to determine which customer characteristics most effectively predict response to a direct-mail marketing promotion. Through exploratory analysis, we found that purchasing behavior variables—such as purchase frequency, number of styles purchased, and total sales—were highly correlated and strongly skewed. To improve model stability, these predictors were log-transformed, and redundant variables were removed based on both multicollinearity (VIF) and statistical insignificance.

Two logistic regression models were evaluated: a larger model including all transformed numeric predictors, and a reduced model focusing on the most meaningful predictors (FRE_log, STYLES_log, and PROMOS). Both models achieved similar predictive accuracy in 5-fold cross-validation, with the reduced model producing the lowest average prediction error (0.1448). Because it provided nearly identical performance with fewer parameters and no multicollinearity concerns, the reduced model was selected as the final specification.

The final model reveals that customers who shop more frequently and purchase a wider variety of items are significantly more likely to respond to a marketing promotion, while those who receive more promotions tend to respond less, possibly due to saturation or fatigue. Overall, this analysis successfully answers the research question: yes, certain behavioral characteristics—especially shopping frequency and product variety—can reliably predict whether a customer will respond to a direct-mail promotion. These insights can guide future marketing strategies by helping retailers better target high-engagement customers and refine their promotional frequency for optimal impact.