This analysis explores factors influencing customer responses to a marketing campaign and builds a predictive model for campaign acceptance.
Research goal: Identify which customer characteristics predict campaign response and provide actionable recommendations for targeting strategy.
library(readr)
library(lubridate)
library (ggplot2)
library(magrittr)
library(dplyr)
# install.packages("coin")
library (coin)
# install.packages("rpart.plot")
library(rpart.plot)
# install.packages("pROC")
library(pROC)
data <- read_csv("~/Documents/r files/ml_project1_data.csv")
Does day of week affect campaign response?
Hypothesis: Customers who enrolled on weekends are more likely to respond to campaign
data$week = wday(data$Dt_Customer, label = TRUE) # adding 'day of week' variable.
data$Response = as.factor(data$Response)
data2 = data %>% filter(Response == 1) %>%
group_by(week) %>%
summarise(count = n())
ch = chisq.test(data$Response, data$week)
ch
##
## Pearson's Chi-squared test
##
## data: data$Response and data$week
## X-squared = 10.987, df = 6, p-value = 0.08878
ggplot(data = data2) +
geom_bar(aes(x = week, y = count),
stat = "identity", fill = "#104E8B") +
geom_text(aes(x = week, y = count,
label = count), vjust = -0.5, size = 3.5) +
labs(title = "Campaign responses by day of week",
subtitle = "No statistically significant relationship detected",
x = "Day of week",
y = "Number of responses") +
theme_classic()
Results: Hypothesis is not confirmed since p-value is higher than 0.01 (p-value = 0.09). The chi-square test shows that there is no statistically significant relationship between the day of the week and if the customer responses to the campaign. Therefore, day of the week can be exluded from the predictive model.
Does income level predict campaign response?
Hypothesis: Higher-income customers are more likely to respond to campaigns.
data3 = data %>% filter(!is.na(Income))
mean_income = mean(data3$Income)
data3 = mutate(data3,
income_group = factor(case_when(Income < mean_income ~ "below average", TRUE ~ "above average")))
ch2 = chisq.test(data3$Response, data3$income_group)
ch2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data3$Response and data3$income_group
## X-squared = 24.483, df = 1, p-value = 7.498e-07
data3 = data3 %>% filter (Response == 1) %>% group_by (income_group) %>% summarise (count = n())
ggplot(data = data3) +
geom_bar(aes(x = income_group, y = count),
stat = "identity", fill = "#104E8B") +
geom_text(aes(x = income_group, y = count,
label = count), vjust = -0.5, size = 3.5) +
labs(title = "Campaign responses by income level",
subtitle = paste0("Mean income threshold: $",
round(mean_income, 0)),
x = "Income group",
y = "Number of responses") + theme_classic()
Results: Hypothesis is confirmed since p-value is lower than 0.01. The chi-square test shows that customers with above average income are significantly more likely to respond.
Does a type of place (web site or directly in store) affect if customers reponce to the campaign?
Hypothesis: Customers making more purchases in the website rather than directly in the store are more likely to respond to the campaign.
t.test (data$NumWebPurchases ~ data$Response)
##
## Welch Two Sample t-test
##
## data: data$NumWebPurchases by data$Response
## t = -7.5416, df = 481.43, p-value = 2.33e-13
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.4622265 -0.8577715
## sample estimates:
## mean in group 0 mean in group 1
## 3.911857 5.071856
ggplot() +
geom_boxplot(data = data, aes(x=Response, y=NumWebPurchases), fill="#BFEFFF") +
ylab ("Number of purchases made on website") +
ggtitle ("Campaign Responses by number of online purchases") +
theme_classic()
t.test (data$NumStorePurchases ~ data$Response)
##
## Welch Two Sample t-test
##
## data: data$NumStorePurchases by data$Response
## t = -1.9458, df = 474.81, p-value = 0.05226
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.721904281 0.003529907
## sample estimates:
## mean in group 0 mean in group 1
## 5.736621 6.095808
ggplot() +
geom_boxplot(data = data, aes(x=Response, y=NumStorePurchases), fill="#BFEFFF") +
ylab ("Number of purchases made in the store") +
ggtitle ("Campaign Responses by number of offline purchases") +
theme_classic()
Results: Hypothesis is partly confirmed. T-test shows that there is a statistically significant difference in mean web purchases between responders and non-responders (p-value < 0.01). Responders have a higher mean of purchases than non-responders. However, no significant difference was found for offline store purchases (p > 0.01), suggesting that digital engagement and not overall purchase frequency is the key behavioural signal for campaign responsiveness.
Does a number of days since the last purchase affects the campaign response?
Hypothesis: Customers who purchased more recently are more likely to respond to the campaign
t.test (data$Recency ~ data$Response)
##
## Welch Two Sample t-test
##
## data: data$Recency by data$Response
## t = 9.786, df = 465.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 12.89220 19.37072
## sample estimates:
## mean in group 0 mean in group 1
## 51.51469 35.38323
ggplot() +
geom_boxplot(data = data, aes(x=Response, y=Recency), fill="#BFEFFF") +
ylab ("Days since the last purchase") +
ggtitle ("Campaign responses by recency of the purchase") +
theme_classic()
Results: Hypothesis is confirmed since t-test shows that there is a statistically significant difference in recency between responders and non-responders (p-value < 0.01). Responders show significantly lower recency values which indicates more recent engagement.
set.seed(888)
train_index = sample(1:nrow(data), size = 0.8 * nrow(data))
train = data[train_index, ]
test = data[-train_index, ]
tree = rpart(
Response ~ Income + NumWebPurchases + Recency,
data = train,
method = "class",
weights = ifelse(train$Response == 1, 5, 1)) # added 5 times more weight to class 1 since there are more negative cases in the dataset
rpart.plot(tree)
pred_class = predict(
tree,
newdata = test,
type = "class")
pred_prob = predict(
tree,
newdata = test,
type = "prob")
mean(pred_class == test$Response)
## [1] 0.7388393
table(pred_class, test$Response)
##
## pred_class 0 1
## 0 291 31
## 1 86 40
roc_obj = roc(test$Response, pred_prob[,2])
auc(roc_obj)
## Area under the curve: 0.6744
plot(roc_obj)
Model accuracy: 0.74 Area under the curve: 0.66. Given the class imbalance in the dataset, overall accuracy is less informative than AUC. The AUC of 0.66 indicates moderate predictive ability above a random baseline. While the model is not highly precise at individual-level prediction, it provides meaningful segmentation and ranking capability for marketing targeting purposes.
The decision tree reveals that Income is the strongest predictor of campaign response, acting as the primary segmentation factor in customer targeting. Customers with higher income levels are more likely to respond, indicating that purchasing power is a key driver of marketing engagement.
Among lower income customers, Recency becomes the most important behavioral factor. Customers who have interacted with the company more recently show significantly higher response rates especially when combined with online purchasing activity.