This case study analyzes Bookbinders Book Club’s (BBBC) direct mail campaign to predict whether customers will purchase The Art History of Florence using predictive modeling techniques. By leveraging customer demographics, purchase history, and preferences, we evaluate the performance of multiple models. Key findings show that book genre interest, past purchase behavior, and recency of engagement strongly influence purchase likelihood.
Among the models applied—logistic regression, SVM, and linear regression—SVM had the best performance with an accuracy of 89.9%. The logistic model provided interpretability, while linear regression was deemed inappropriate due to its assumptions. The analysis supports targeting customers more effectively using data-driven approaches to reduce marketing costs and improve ROI.
This is a binary classification problem. The dependent variable is
Choice
(purchase: yes/no), with predictor variables
including demographics, purchase behaviors, and interest indicators.
BBBC aims to: - Determine which customers to target - Identify
cost-effective marketing strategies - Understand influential predictors
of purchase behavior
Previous studies on classification in marketing, such as telecom churn prediction, have used logistic regression, SVM, and linear regression. While logistic and linear models aim to identify decision boundaries or predict values, SVMs focus on margin maximization. These models differ in their strengths: logistic regression is interpretable, SVMs handle complexity well, and linear regression is unsuitable for binary outcomes. Literature Review Link
We implemented and compared the following models: - Linear
Regression: Used but since the dependent variable was
categorical it gave poor results - Logistic Regression:
Interpretability, suitable for binary outcomes, sensitive to
non-linearity. - AIC when the goal is predictive
accuracy.
- BIC when aiming for simplicity and avoiding
overfitting. - SVM: High accuracy and margin
optimization, limited interpretability
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ROCR)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
library(rpart.plot)
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.3.3
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
The data is split into training (1,600 obs) and testing (2,300 obs)
sets across 12 variables. After converting categorical variables and
scaling numerics, only the Observation
column was removed.
Variables include: - Demographic: Gender -
Purchase Behavior: Amount_purchased, Frequency,
Last_purchase, First_purchase - Interest Levels:
P_Child, P_Youth, P_Cook, P_DIY, P_Art
Skewness was found in the distributions, and variable encoding and scaling were applied accordingly. Class imbalance was noted and considered during evaluation.
# Loads the training and testing datasets
BBBC_test <- read_excel('BBBC-Test.xlsx')
BBBC_train <- read_excel('BBBC-Train.xlsx')
# Check for imbalances in the variables
summary(BBBC_test)
## Observation Choice Gender Amount_purchased
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. : 15.0
## 1st Qu.: 575.8 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:119.0
## Median :1150.5 Median :0.0000 Median :1.0000 Median :198.0
## Mean :1150.5 Mean :0.0887 Mean :0.6865 Mean :195.3
## 3rd Qu.:1725.2 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:268.0
## Max. :2300.0 Max. :1.0000 Max. :1.0000 Max. :461.0
## Frequency Last_purchase First_purchase P_Child
## Min. : 2.0 Min. : 1.000 Min. : 2.00 Min. :0.0000
## 1st Qu.: 8.0 1st Qu.: 1.000 1st Qu.:12.00 1st Qu.:0.0000
## Median :12.0 Median : 2.000 Median :18.00 Median :0.0000
## Mean :13.3 Mean : 3.059 Mean :22.85 Mean :0.7287
## 3rd Qu.:16.0 3rd Qu.: 4.000 3rd Qu.:32.00 3rd Qu.:1.0000
## Max. :36.0 Max. :12.000 Max. :96.00 Max. :7.0000
## P_Youth P_Cook P_DIY P_Art
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00
## Mean :0.3426 Mean :0.7857 Mean :0.4061 Mean :0.33
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00
## Max. :5.0000 Max. :6.0000 Max. :4.0000 Max. :4.00
summary(BBBC_train)
## Observation Choice Gender Amount_purchased
## Min. : 1.0 Min. :0.00 Min. :0.0000 Min. : 15.0
## 1st Qu.: 400.8 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:126.8
## Median : 800.5 Median :0.00 Median :1.0000 Median :203.0
## Mean : 800.5 Mean :0.25 Mean :0.6587 Mean :200.9
## 3rd Qu.:1200.2 3rd Qu.:0.25 3rd Qu.:1.0000 3rd Qu.:273.0
## Max. :1600.0 Max. :1.00 Max. :1.0000 Max. :474.0
## Frequency Last_purchase First_purchase P_Child
## Min. : 2.00 Min. : 1.000 Min. : 2.00 Min. :0.0000
## 1st Qu.: 6.00 1st Qu.: 1.000 1st Qu.:12.00 1st Qu.:0.0000
## Median :12.00 Median : 2.000 Median :18.00 Median :0.0000
## Mean :12.31 Mean : 3.199 Mean :22.58 Mean :0.7394
## 3rd Qu.:16.00 3rd Qu.: 4.000 3rd Qu.:30.00 3rd Qu.:1.0000
## Max. :36.00 Max. :12.000 Max. :96.00 Max. :8.0000
## P_Youth P_Cook P_DIY P_Art
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.00 Median :0.0000 Median :0.000
## Mean :0.3375 Mean :0.76 Mean :0.3912 Mean :0.425
## 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :4.0000 Max. :6.00 Max. :4.0000 Max. :5.000
# Remove the observation variable as it just counts the number of rows
BBBC_test = dplyr::select(BBBC_test, -c(Observation))
BBBC_train = dplyr::select(BBBC_train, -c(Observation))
# Turns Gender into a factor variable
BBBC_train$Gender = as.factor(BBBC_train$Gender)
BBBC_test$Gender = as.factor(BBBC_test$Gender)
str(BBBC_train)
## tibble [1,600 Ă— 11] (S3: tbl_df/tbl/data.frame)
## $ Choice : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 2 2 ...
## $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
## $ Frequency : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
## $ Last_purchase : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
## $ First_purchase : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
## $ P_Child : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
## $ P_Youth : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
## $ P_Cook : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
## $ P_DIY : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
## $ P_Art : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...
#Check for missings in the training dataset
sum(is.na(BBBC_train))
## [1] 0
ggplot(BBBC_train, aes(x = Amount_purchased, fill = Choice)) +
geom_histogram(alpha = 0.6, position = "identity") +
labs(title = "Distribution of Amount Purchased by Choice")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Example: Boxplot of 'Last_purchase' by Choice
ggplot(BBBC_train, aes(x = Choice, y = Last_purchase, fill = Choice)) +
geom_boxplot() +
labs(title = "Months Since Last Purchase by Choice")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Bar plot for Gender
ggplot(BBBC_train, aes(x = Gender, fill = Choice)) +
geom_bar(position = "dodge") +
labs(title = "Gender Distribution by Choice")
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Create a correlation plot to check for multicollinearity
BBBC_train_num = select_if(BBBC_train, is.numeric)
M = cor(BBBC_train_num)
corrplot(M, method = "number")
# Fit linear regression model
linear_model <- lm(Choice ~ ., data = BBBC_train)
# Summarize the model
summary(linear_model)
##
## Call:
## lm(formula = Choice ~ ., data = BBBC_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9603 -0.2462 -0.1161 0.1622 1.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3642284 0.0307411 11.848 < 2e-16 ***
## Gender1 -0.1309205 0.0200303 -6.536 8.48e-11 ***
## Amount_purchased 0.0002736 0.0001110 2.464 0.0138 *
## Frequency -0.0090868 0.0021791 -4.170 3.21e-05 ***
## Last_purchase 0.0970286 0.0135589 7.156 1.26e-12 ***
## First_purchase -0.0020024 0.0018160 -1.103 0.2704
## P_Child -0.1262584 0.0164011 -7.698 2.41e-14 ***
## P_Youth -0.0963563 0.0201097 -4.792 1.81e-06 ***
## P_Cook -0.1414907 0.0166064 -8.520 < 2e-16 ***
## P_DIY -0.1352313 0.0197873 -6.834 1.17e-11 ***
## P_Art 0.1178494 0.0194427 6.061 1.68e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared: 0.2401, Adjusted R-squared: 0.2353
## F-statistic: 50.2 on 10 and 1589 DF, p-value: < 2.2e-16
# Stepwise selection to remove irrelevant variables
step_model <- stepAIC(linear_model, direction = "both")
## Start: AIC=-3095.59
## Choice ~ Gender + Amount_purchased + Frequency + Last_purchase +
## First_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art
##
## Df Sum of Sq RSS AIC
## - First_purchase 1 0.1744 228.16 -3096.4
## <none> 227.98 -3095.6
## - Amount_purchased 1 0.8711 228.85 -3091.5
## - Frequency 1 2.4949 230.48 -3080.2
## - P_Youth 1 3.2940 231.28 -3074.6
## - P_Art 1 5.2713 233.25 -3061.0
## - Gender 1 6.1294 234.11 -3055.1
## - P_DIY 1 6.7013 234.68 -3051.2
## - Last_purchase 1 7.3473 235.33 -3046.8
## - P_Child 1 8.5026 236.49 -3039.0
## - P_Cook 1 10.4155 238.40 -3026.1
##
## Step: AIC=-3096.36
## Choice ~ Gender + Amount_purchased + Frequency + Last_purchase +
## P_Child + P_Youth + P_Cook + P_DIY + P_Art
##
## Df Sum of Sq RSS AIC
## <none> 228.16 -3096.4
## + First_purchase 1 0.1744 227.98 -3095.6
## - Amount_purchased 1 0.8752 229.03 -3092.2
## - P_Youth 1 3.3702 231.53 -3074.9
## - P_Art 1 5.1098 233.27 -3062.9
## - Gender 1 6.2042 234.36 -3055.4
## - P_DIY 1 6.8588 235.01 -3051.0
## - Last_purchase 1 8.4161 236.57 -3040.4
## - P_Child 1 8.7321 236.89 -3038.3
## - P_Cook 1 10.8023 238.96 -3024.3
## - Frequency 1 11.9827 240.14 -3016.5
summary(step_model)
##
## Call:
## lm(formula = Choice ~ Gender + Amount_purchased + Frequency +
## Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art,
## data = BBBC_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9802 -0.2452 -0.1157 0.1655 1.0595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3727367 0.0297590 12.525 < 2e-16 ***
## Gender1 -0.1316464 0.0200208 -6.575 6.56e-11 ***
## Amount_purchased 0.0002742 0.0001110 2.470 0.0136 *
## Frequency -0.0110830 0.0012128 -9.138 < 2e-16 ***
## Last_purchase 0.0894288 0.0116772 7.658 3.25e-14 ***
## P_Child -0.1275991 0.0163571 -7.801 1.11e-14 ***
## P_Youth -0.0973642 0.0200903 -4.846 1.38e-06 ***
## P_Cook -0.1433497 0.0165218 -8.676 < 2e-16 ***
## P_DIY -0.1365578 0.0197520 -6.914 6.82e-12 ***
## P_Art 0.1150034 0.0192719 5.967 2.97e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared: 0.2395, Adjusted R-squared: 0.2352
## F-statistic: 55.63 on 9 and 1590 DF, p-value: < 2.2e-16
# Generate predictions (probabilities between 0 and 1)
predicted_probs <- predict(step_model, newdata = BBBC_test)
# Convert probabilities to binary predictions (0/1) using a threshold (e.g., 0.5)
predicted_class <- ifelse(predicted_probs >= 0.5, 1, 0)
# Compare with actual outcomes
confusionMatrix(as.factor(predicted_class), as.factor(BBBC_test$Choice), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2007 144
## 1 89 60
##
## Accuracy : 0.8987
## 95% CI : (0.8856, 0.9107)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9833734
##
## Kappa : 0.2865
##
## Mcnemar's Test P-Value : 0.0004037
##
## Sensitivity : 0.29412
## Specificity : 0.95754
## Pos Pred Value : 0.40268
## Neg Pred Value : 0.93305
## Prevalence : 0.08870
## Detection Rate : 0.02609
## Detection Prevalence : 0.06478
## Balanced Accuracy : 0.62583
##
## 'Positive' Class : 1
##
Logistic regression is used for binary classification. It estimates the probability that a given input belongs to a particular class. It’s ideal when the relationship between features and the outcome is linear, and it provides interpretable coefficients.
# Chnage the Choice variable into a factor variable for the log model
BBBC_train$Choice = as.factor(BBBC_train$Choice)
# Fit full logistic regression model
logit_model <- glm(Choice ~ ., data = BBBC_train, family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = Choice ~ ., family = "binomial", data = BBBC_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3515281 0.2143839 -1.640 0.1011
## Gender1 -0.8632319 0.1374499 -6.280 3.38e-10 ***
## Amount_purchased 0.0018641 0.0007918 2.354 0.0186 *
## Frequency -0.0755142 0.0165937 -4.551 5.35e-06 ***
## Last_purchase 0.6117713 0.0938127 6.521 6.97e-11 ***
## First_purchase -0.0147792 0.0128027 -1.154 0.2483
## P_Child -0.8112489 0.1167067 -6.951 3.62e-12 ***
## P_Youth -0.6370422 0.1433778 -4.443 8.87e-06 ***
## P_Cook -0.9230066 0.1194814 -7.725 1.12e-14 ***
## P_DIY -0.9058697 0.1437025 -6.304 2.90e-10 ***
## P_Art 0.6861124 0.1270176 5.402 6.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1392.2 on 1589 degrees of freedom
## AIC: 1414.2
##
## Number of Fisher Scoring iterations: 5
# Perform stepwise variable selection (AIC)
set.seed(1)
final_model <- stepAIC(logit_model, direction = "both", trace = FALSE)
summary(final_model)
##
## Call:
## glm(formula = Choice ~ Gender + Amount_purchased + Frequency +
## Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art,
## family = "binomial", data = BBBC_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2833949 0.2062721 -1.374 0.1695
## Gender1 -0.8660575 0.1373268 -6.307 2.85e-10 ***
## Amount_purchased 0.0018357 0.0007908 2.321 0.0203 *
## Frequency -0.0903261 0.0106304 -8.497 < 2e-16 ***
## Last_purchase 0.5536689 0.0784519 7.057 1.70e-12 ***
## P_Child -0.8181807 0.1163377 -7.033 2.02e-12 ***
## P_Youth -0.6424923 0.1432548 -4.485 7.29e-06 ***
## P_Cook -0.9330131 0.1190073 -7.840 4.51e-15 ***
## P_DIY -0.9101106 0.1433591 -6.348 2.17e-10 ***
## P_Art 0.6643371 0.1255243 5.292 1.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1393.5 on 1590 degrees of freedom
## AIC: 1413.5
##
## Number of Fisher Scoring iterations: 5
AIC = 2k - 2ln(L)
, where:
k
= number of parametersL
= maximum likelihood of the model#AIC
glm.aic <- step(logit_model, scope = list(upper = logit_model),
direction = "both", test = "Chisq", trace = F)
summary(glm.aic)
##
## Call:
## glm(formula = Choice ~ Gender + Amount_purchased + Frequency +
## Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art,
## family = "binomial", data = BBBC_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2833949 0.2062721 -1.374 0.1695
## Gender1 -0.8660575 0.1373268 -6.307 2.85e-10 ***
## Amount_purchased 0.0018357 0.0007908 2.321 0.0203 *
## Frequency -0.0903261 0.0106304 -8.497 < 2e-16 ***
## Last_purchase 0.5536689 0.0784519 7.057 1.70e-12 ***
## P_Child -0.8181807 0.1163377 -7.033 2.02e-12 ***
## P_Youth -0.6424923 0.1432548 -4.485 7.29e-06 ***
## P_Cook -0.9330131 0.1190073 -7.840 4.51e-15 ***
## P_DIY -0.9101106 0.1433591 -6.348 2.17e-10 ***
## P_Art 0.6643371 0.1255243 5.292 1.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1393.5 on 1590 degrees of freedom
## AIC: 1413.5
##
## Number of Fisher Scoring iterations: 5
pred.aic = prediction(predict(glm.aic,BBBC_train,type='response'),BBBC_train$Choice)
#AIC sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)
# AIC second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2))
mtext('Specificity',side=4, col='red')
BIC = ln(n)k - 2ln(L)
, where:
n
= sample sizek
= number of parametersL
= maximum likelihood#BIC
# BIC Model Selection
glm.bic <- step(logit_model, scope = list(upper = logit_model),
direction = "both", test = "Chisq", trace = F, k = log(nrow(BBBC_train)))
summary(glm.bic)
##
## Call:
## glm(formula = Choice ~ Gender + Frequency + Last_purchase + P_Child +
## P_Youth + P_Cook + P_DIY + P_Art, family = "binomial", data = BBBC_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.00390 0.16493 0.024 0.981
## Gender1 -0.86499 0.13703 -6.312 2.75e-10 ***
## Frequency -0.08974 0.01064 -8.436 < 2e-16 ***
## Last_purchase 0.57397 0.07822 7.338 2.16e-13 ***
## P_Child -0.81070 0.11618 -6.978 3.00e-12 ***
## P_Youth -0.64318 0.14338 -4.486 7.27e-06 ***
## P_Cook -0.92741 0.11890 -7.800 6.19e-15 ***
## P_DIY -0.90981 0.14307 -6.359 2.03e-10 ***
## P_Art 0.67881 0.12538 5.414 6.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1398.9 on 1591 degrees of freedom
## AIC: 1416.9
##
## Number of Fisher Scoring iterations: 5
# Predict probabilities using final BIC model
predicted_probs <- predict(glm.bic, newdata = BBBC_test, type = "response")
# Convert probabilities to predicted class labels
predicted_class <- ifelse(predicted_probs >= 0.5, "1", "0")
# Ensure both vectors are factors with same levels for confusionMatrix
predicted_class <- factor(predicted_class, levels = c("0", "1"))
actual_class <- factor(BBBC_test$Choice, levels = c("0", "1"))
# Confusion matrix
confusionMatrix(predicted_class, actual_class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1975 127
## 1 121 77
##
## Accuracy : 0.8922
## 95% CI : (0.8788, 0.9046)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9993
##
## Kappa : 0.324
##
## Mcnemar's Test P-Value : 0.7509
##
## Sensitivity : 0.37745
## Specificity : 0.94227
## Pos Pred Value : 0.38889
## Neg Pred Value : 0.93958
## Prevalence : 0.08870
## Detection Rate : 0.03348
## Detection Prevalence : 0.08609
## Balanced Accuracy : 0.65986
##
## 'Positive' Class : 1
##
# Convert predictions to match "0"/"1" levels of actual data
predclass_log_test <- factor(
predicted_class,
levels = c("no", "yes"), # Original levels in predictions
labels = c("0", "1") # New labels to match actual data
)
# Verify levels now match
levels(predicted_class)
## [1] "0" "1"
BBBC_test$Choice <- factor(
BBBC_test$Choice,
levels = c("0", "1") # Ensure levels match predictions
)
levels(BBBC_test$Choice)
## [1] "0" "1"
levels(predclass_log_test)
## [1] "0" "1"
# Predict probabilities on the TEST data
set.seed(1)
predprob_test <- predict(final_model, newdata = BBBC_test, type = "response")
# Convert probabilities to classes (0/1)
predclass_log_test <- ifelse(predprob_test >= 0.37, "1", "0")
# Convert to factor with levels matching BBBC_test$Choice
predclass_log_test <- factor(
predclass_log_test,
levels = c("0", "1"),
labels = c("0", "1")
)
# Convert BBBC_test$Choice to factor (if not already)
BBBC_test$Choice <- factor(
BBBC_test$Choice,
levels = c("0", "1"),
labels = c("0", "1")
)
# Verify levels
levels(predclass_log_test) # Should be "0", "1"
## [1] "0" "1"
levels(BBBC_test$Choice) # Should be "0", "1"
## [1] "0" "1"
confusionMatrix(
predclass_log_test,
BBBC_test$Choice, # Now both have matching levels
positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1830 94
## 1 266 110
##
## Accuracy : 0.8435
## 95% CI : (0.828, 0.8581)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2987
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.53922
## Specificity : 0.87309
## Pos Pred Value : 0.29255
## Neg Pred Value : 0.95114
## Prevalence : 0.08870
## Detection Rate : 0.04783
## Detection Prevalence : 0.16348
## Balanced Accuracy : 0.70615
##
## 'Positive' Class : 1
##
SVM finds the hyperplane that best separates classes in high-dimensional space. It’s effective in complex, non-linear problems and performs well when there is a clear margin of separation.
# Scale numeric variables (excluding Gender)
preProc <- preProcess(BBBC_train, method = c("center", "scale"))
BBBC_train_scaled <- predict(preProc, BBBC_train)
BBBC_test_scaled <- predict(preProc, BBBC_test)
# Tune SVM with cross-validation (linear kernel)
set.seed(1)
tune_svm <- tune(svm,
Choice ~ .,
data = BBBC_train_scaled,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 10)),
tunecontrol = tune.control(sampling = "cross", cross = 5)
)
best_svm <- tune_svm$best.model
# Predictions on test data
svm_test_pred <- predict(best_svm, BBBC_test_scaled)
levels(svm_test_pred) # Check levels of predictions
## [1] "0" "1"
levels(BBBC_test_scaled$Choice)
## [1] "0" "1"
BBBC_test_scaled$Choice <- factor(
BBBC_test_scaled$Choice,
levels = c("0", "1") # Ensure levels match predictions
)
svm_test_pred <- predict(best_svm, BBBC_test_scaled)
svm_test_pred <- factor(
svm_test_pred,
levels = levels(BBBC_test_scaled$Choice) # Match reference levels
)
# Confusion matrix and metrics
confusionMatrix(svm_test_pred, BBBC_test_scaled$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2011 147
## 1 85 57
##
## Accuracy : 0.8991
## 95% CI : (0.8861, 0.9111)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9802
##
## Kappa : 0.2768
##
## Mcnemar's Test P-Value : 6.206e-05
##
## Sensitivity : 0.27941
## Specificity : 0.95945
## Pos Pred Value : 0.40141
## Neg Pred Value : 0.93188
## Prevalence : 0.08870
## Detection Rate : 0.02478
## Detection Prevalence : 0.06174
## Balanced Accuracy : 0.61943
##
## 'Positive' Class : 1
##
After evaluating all models: - SVM yielded the best accuracy (89.9%) for predicting customer purchases. - Logistic Regression offered valuable insights into the influence of individual predictors with an accuracy of 84.4%
Targeting the top 30% of customers predicted most likely to buy can result in significant cost savings and higher marketing ROI. We recommend using SVM for final prediction, supported by insights from logistic regression for strategic understanding.