Executive Summary

This case study analyzes Bookbinders Book Club’s (BBBC) direct mail campaign to predict whether customers will purchase The Art History of Florence using predictive modeling techniques. By leveraging customer demographics, purchase history, and preferences, we evaluate the performance of multiple models. Key findings show that book genre interest, past purchase behavior, and recency of engagement strongly influence purchase likelihood.

Among the models applied—logistic regression, SVM, and linear regression—SVM had the best performance with an accuracy of 89.9%. The logistic model provided interpretability, while linear regression was deemed inappropriate due to its assumptions. The analysis supports targeting customers more effectively using data-driven approaches to reduce marketing costs and improve ROI.

Problem

This is a binary classification problem. The dependent variable is Choice (purchase: yes/no), with predictor variables including demographics, purchase behaviors, and interest indicators. BBBC aims to: - Determine which customers to target - Identify cost-effective marketing strategies - Understand influential predictors of purchase behavior

Literature Review

Previous studies on classification in marketing, such as telecom churn prediction, have used logistic regression, SVM, and linear regression. While logistic and linear models aim to identify decision boundaries or predict values, SVMs focus on margin maximization. These models differ in their strengths: logistic regression is interpretable, SVMs handle complexity well, and linear regression is unsuitable for binary outcomes. Literature Review Link

Methodology

We implemented and compared the following models: - Linear Regression: Used but since the dependent variable was categorical it gave poor results - Logistic Regression: Interpretability, suitable for binary outcomes, sensitive to non-linearity. - AIC when the goal is predictive accuracy.
- BIC when aiming for simplicity and avoiding overfitting. - SVM: High accuracy and margin optimization, limited interpretability

Full Model Implementation

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ROCR)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

library(rpart.plot)

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 4.3.3

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Data Overview

The data is split into training (1,600 obs) and testing (2,300 obs) sets across 12 variables. After converting categorical variables and scaling numerics, only the Observation column was removed. Variables include: - Demographic: Gender - Purchase Behavior: Amount_purchased, Frequency, Last_purchase, First_purchase - Interest Levels: P_Child, P_Youth, P_Cook, P_DIY, P_Art

Skewness was found in the distributions, and variable encoding and scaling were applied accordingly. Class imbalance was noted and considered during evaluation.

# Loads the training and testing datasets
BBBC_test <- read_excel('BBBC-Test.xlsx')
BBBC_train <- read_excel('BBBC-Train.xlsx')

# Check for imbalances in the variables 
summary(BBBC_test)

##   Observation         Choice           Gender       Amount_purchased
##  Min.   :   1.0   Min.   :0.0000   Min.   :0.0000   Min.   : 15.0   
##  1st Qu.: 575.8   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:119.0   
##  Median :1150.5   Median :0.0000   Median :1.0000   Median :198.0   
##  Mean   :1150.5   Mean   :0.0887   Mean   :0.6865   Mean   :195.3   
##  3rd Qu.:1725.2   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:268.0   
##  Max.   :2300.0   Max.   :1.0000   Max.   :1.0000   Max.   :461.0   
##    Frequency    Last_purchase    First_purchase     P_Child      
##  Min.   : 2.0   Min.   : 1.000   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.: 8.0   1st Qu.: 1.000   1st Qu.:12.00   1st Qu.:0.0000  
##  Median :12.0   Median : 2.000   Median :18.00   Median :0.0000  
##  Mean   :13.3   Mean   : 3.059   Mean   :22.85   Mean   :0.7287  
##  3rd Qu.:16.0   3rd Qu.: 4.000   3rd Qu.:32.00   3rd Qu.:1.0000  
##  Max.   :36.0   Max.   :12.000   Max.   :96.00   Max.   :7.0000  
##     P_Youth           P_Cook           P_DIY            P_Art     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00  
##  Mean   :0.3426   Mean   :0.7857   Mean   :0.4061   Mean   :0.33  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00  
##  Max.   :5.0000   Max.   :6.0000   Max.   :4.0000   Max.   :4.00

summary(BBBC_train)

##   Observation         Choice         Gender       Amount_purchased
##  Min.   :   1.0   Min.   :0.00   Min.   :0.0000   Min.   : 15.0   
##  1st Qu.: 400.8   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:126.8   
##  Median : 800.5   Median :0.00   Median :1.0000   Median :203.0   
##  Mean   : 800.5   Mean   :0.25   Mean   :0.6587   Mean   :200.9   
##  3rd Qu.:1200.2   3rd Qu.:0.25   3rd Qu.:1.0000   3rd Qu.:273.0   
##  Max.   :1600.0   Max.   :1.00   Max.   :1.0000   Max.   :474.0   
##    Frequency     Last_purchase    First_purchase     P_Child      
##  Min.   : 2.00   Min.   : 1.000   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.: 6.00   1st Qu.: 1.000   1st Qu.:12.00   1st Qu.:0.0000  
##  Median :12.00   Median : 2.000   Median :18.00   Median :0.0000  
##  Mean   :12.31   Mean   : 3.199   Mean   :22.58   Mean   :0.7394  
##  3rd Qu.:16.00   3rd Qu.: 4.000   3rd Qu.:30.00   3rd Qu.:1.0000  
##  Max.   :36.00   Max.   :12.000   Max.   :96.00   Max.   :8.0000  
##     P_Youth           P_Cook         P_DIY            P_Art      
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.00   Median :0.0000   Median :0.000  
##  Mean   :0.3375   Mean   :0.76   Mean   :0.3912   Mean   :0.425  
##  3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :4.0000   Max.   :6.00   Max.   :4.0000   Max.   :5.000

# Remove the observation variable as it just counts the number of rows
BBBC_test = dplyr::select(BBBC_test, -c(Observation))
BBBC_train = dplyr::select(BBBC_train, -c(Observation))

# Turns Gender into a factor variable
BBBC_train$Gender = as.factor(BBBC_train$Gender)
BBBC_test$Gender = as.factor(BBBC_test$Gender)
str(BBBC_train)

## tibble [1,600 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Choice          : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 2 2 ...
##  $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
##  $ Frequency       : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
##  $ Last_purchase   : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
##  $ First_purchase  : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
##  $ P_Child         : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
##  $ P_Youth         : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
##  $ P_Cook          : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
##  $ P_DIY           : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
##  $ P_Art           : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...

#Check for missings in the training dataset
sum(is.na(BBBC_train))

## [1] 0

ggplot(BBBC_train, aes(x = Amount_purchased, fill = Choice)) +
  geom_histogram(alpha = 0.6, position = "identity") +
  labs(title = "Distribution of Amount Purchased by Choice")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Example: Boxplot of 'Last_purchase' by Choice
ggplot(BBBC_train, aes(x = Choice, y = Last_purchase, fill = Choice)) +
  geom_boxplot() +
  labs(title = "Months Since Last Purchase by Choice")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Bar plot for Gender
ggplot(BBBC_train, aes(x = Gender, fill = Choice)) +
  geom_bar(position = "dodge") +
  labs(title = "Gender Distribution by Choice")

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Create a correlation plot to check for multicollinearity
BBBC_train_num = select_if(BBBC_train, is.numeric)
M = cor(BBBC_train_num)
corrplot(M, method = "number")

# Fit linear regression model
linear_model <- lm(Choice ~ ., data = BBBC_train)

# Summarize the model
summary(linear_model)

## 
## Call:
## lm(formula = Choice ~ ., data = BBBC_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9603 -0.2462 -0.1161  0.1622  1.0588 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.3642284  0.0307411  11.848  < 2e-16 ***
## Gender1          -0.1309205  0.0200303  -6.536 8.48e-11 ***
## Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
## Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
## Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
## First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
## P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
## P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
## P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
## P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
## P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2353 
## F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16

# Stepwise selection to remove irrelevant variables
step_model <- stepAIC(linear_model, direction = "both")

## Start:  AIC=-3095.59
## Choice ~ Gender + Amount_purchased + Frequency + Last_purchase + 
##     First_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art
## 
##                    Df Sum of Sq    RSS     AIC
## - First_purchase    1    0.1744 228.16 -3096.4
## <none>                          227.98 -3095.6
## - Amount_purchased  1    0.8711 228.85 -3091.5
## - Frequency         1    2.4949 230.48 -3080.2
## - P_Youth           1    3.2940 231.28 -3074.6
## - P_Art             1    5.2713 233.25 -3061.0
## - Gender            1    6.1294 234.11 -3055.1
## - P_DIY             1    6.7013 234.68 -3051.2
## - Last_purchase     1    7.3473 235.33 -3046.8
## - P_Child           1    8.5026 236.49 -3039.0
## - P_Cook            1   10.4155 238.40 -3026.1
## 
## Step:  AIC=-3096.36
## Choice ~ Gender + Amount_purchased + Frequency + Last_purchase + 
##     P_Child + P_Youth + P_Cook + P_DIY + P_Art
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          228.16 -3096.4
## + First_purchase    1    0.1744 227.98 -3095.6
## - Amount_purchased  1    0.8752 229.03 -3092.2
## - P_Youth           1    3.3702 231.53 -3074.9
## - P_Art             1    5.1098 233.27 -3062.9
## - Gender            1    6.2042 234.36 -3055.4
## - P_DIY             1    6.8588 235.01 -3051.0
## - Last_purchase     1    8.4161 236.57 -3040.4
## - P_Child           1    8.7321 236.89 -3038.3
## - P_Cook            1   10.8023 238.96 -3024.3
## - Frequency         1   11.9827 240.14 -3016.5

summary(step_model)

## 
## Call:
## lm(formula = Choice ~ Gender + Amount_purchased + Frequency + 
##     Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art, 
##     data = BBBC_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9802 -0.2452 -0.1157  0.1655  1.0595 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.3727367  0.0297590  12.525  < 2e-16 ***
## Gender1          -0.1316464  0.0200208  -6.575 6.56e-11 ***
## Amount_purchased  0.0002742  0.0001110   2.470   0.0136 *  
## Frequency        -0.0110830  0.0012128  -9.138  < 2e-16 ***
## Last_purchase     0.0894288  0.0116772   7.658 3.25e-14 ***
## P_Child          -0.1275991  0.0163571  -7.801 1.11e-14 ***
## P_Youth          -0.0973642  0.0200903  -4.846 1.38e-06 ***
## P_Cook           -0.1433497  0.0165218  -8.676  < 2e-16 ***
## P_DIY            -0.1365578  0.0197520  -6.914 6.82e-12 ***
## P_Art             0.1150034  0.0192719   5.967 2.97e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared:  0.2395, Adjusted R-squared:  0.2352 
## F-statistic: 55.63 on 9 and 1590 DF,  p-value: < 2.2e-16

# Generate predictions (probabilities between 0 and 1)
predicted_probs <- predict(step_model, newdata = BBBC_test)

# Convert probabilities to binary predictions (0/1) using a threshold (e.g., 0.5)
predicted_class <- ifelse(predicted_probs >= 0.5, 1, 0)

# Compare with actual outcomes
confusionMatrix(as.factor(predicted_class), as.factor(BBBC_test$Choice), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2007  144
##          1   89   60
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8856, 0.9107)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9833734       
##                                           
##                   Kappa : 0.2865          
##                                           
##  Mcnemar's Test P-Value : 0.0004037       
##                                           
##             Sensitivity : 0.29412         
##             Specificity : 0.95754         
##          Pos Pred Value : 0.40268         
##          Neg Pred Value : 0.93305         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02609         
##    Detection Prevalence : 0.06478         
##       Balanced Accuracy : 0.62583         
##                                           
##        'Positive' Class : 1               
##

Logistic Regression

Logistic regression is used for binary classification. It estimates the probability that a given input belongs to a particular class. It’s ideal when the relationship between features and the outcome is linear, and it provides interpretable coefficients.

# Chnage the Choice variable into a factor variable for the log model
BBBC_train$Choice = as.factor(BBBC_train$Choice)

# Fit full logistic regression model
logit_model <- glm(Choice ~ ., data = BBBC_train, family = "binomial")
summary(logit_model)

## 
## Call:
## glm(formula = Choice ~ ., family = "binomial", data = BBBC_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.3515281  0.2143839  -1.640   0.1011    
## Gender1          -0.8632319  0.1374499  -6.280 3.38e-10 ***
## Amount_purchased  0.0018641  0.0007918   2.354   0.0186 *  
## Frequency        -0.0755142  0.0165937  -4.551 5.35e-06 ***
## Last_purchase     0.6117713  0.0938127   6.521 6.97e-11 ***
## First_purchase   -0.0147792  0.0128027  -1.154   0.2483    
## P_Child          -0.8112489  0.1167067  -6.951 3.62e-12 ***
## P_Youth          -0.6370422  0.1433778  -4.443 8.87e-06 ***
## P_Cook           -0.9230066  0.1194814  -7.725 1.12e-14 ***
## P_DIY            -0.9058697  0.1437025  -6.304 2.90e-10 ***
## P_Art             0.6861124  0.1270176   5.402 6.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1392.2  on 1589  degrees of freedom
## AIC: 1414.2
## 
## Number of Fisher Scoring iterations: 5

# Perform stepwise variable selection (AIC)
set.seed(1)
final_model <- stepAIC(logit_model, direction = "both", trace = FALSE)
summary(final_model)

## 
## Call:
## glm(formula = Choice ~ Gender + Amount_purchased + Frequency + 
##     Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art, 
##     family = "binomial", data = BBBC_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.2833949  0.2062721  -1.374   0.1695    
## Gender1          -0.8660575  0.1373268  -6.307 2.85e-10 ***
## Amount_purchased  0.0018357  0.0007908   2.321   0.0203 *  
## Frequency        -0.0903261  0.0106304  -8.497  < 2e-16 ***
## Last_purchase     0.5536689  0.0784519   7.057 1.70e-12 ***
## P_Child          -0.8181807  0.1163377  -7.033 2.02e-12 ***
## P_Youth          -0.6424923  0.1432548  -4.485 7.29e-06 ***
## P_Cook           -0.9330131  0.1190073  -7.840 4.51e-15 ***
## P_DIY            -0.9101106  0.1433591  -6.348 2.17e-10 ***
## P_Art             0.6643371  0.1255243   5.292 1.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1393.5  on 1590  degrees of freedom
## AIC: 1413.5
## 
## Number of Fisher Scoring iterations: 5

AIC (Akaike Information Criterion)

Purpose: Helps identify the model that best balances goodness-of-fit and simplicity.
Formula: AIC = 2k - 2ln(L), where:
- k = number of parameters
- L = maximum likelihood of the model
Interpretation: A lower AIC value indicates a better model (when comparing multiple models).
Tendency: AIC slightly favors more complex models if they offer a better fit.

#AIC
glm.aic <- step(logit_model, scope = list(upper = logit_model),
                direction = "both", test = "Chisq", trace = F)
summary(glm.aic)

## 
## Call:
## glm(formula = Choice ~ Gender + Amount_purchased + Frequency + 
##     Last_purchase + P_Child + P_Youth + P_Cook + P_DIY + P_Art, 
##     family = "binomial", data = BBBC_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.2833949  0.2062721  -1.374   0.1695    
## Gender1          -0.8660575  0.1373268  -6.307 2.85e-10 ***
## Amount_purchased  0.0018357  0.0007908   2.321   0.0203 *  
## Frequency        -0.0903261  0.0106304  -8.497  < 2e-16 ***
## Last_purchase     0.5536689  0.0784519   7.057 1.70e-12 ***
## P_Child          -0.8181807  0.1163377  -7.033 2.02e-12 ***
## P_Youth          -0.6424923  0.1432548  -4.485 7.29e-06 ***
## P_Cook           -0.9330131  0.1190073  -7.840 4.51e-15 ***
## P_DIY            -0.9101106  0.1433591  -6.348 2.17e-10 ***
## P_Art             0.6643371  0.1255243   5.292 1.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1393.5  on 1590  degrees of freedom
## AIC: 1413.5
## 
## Number of Fisher Scoring iterations: 5

pred.aic = prediction(predict(glm.aic,BBBC_train,type='response'),BBBC_train$Choice)

#AIC sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)

# AIC second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2)) 
mtext('Specificity',side=4, col='red')

BIC (Bayesian Information Criterion)

Purpose: Similar to AIC but imposes a stricter penalty for model complexity.
Formula: BIC = ln(n)k - 2ln(L), where:
- n = sample size
- k = number of parameters
- L = maximum likelihood
Interpretation: A lower BIC value suggests a more parsimonious and likely better model.
Tendency: BIC prefers simpler models, especially with larger sample sizes.

#BIC
# BIC Model Selection
glm.bic <- step(logit_model, scope = list(upper = logit_model), 
                direction = "both", test = "Chisq", trace = F, k = log(nrow(BBBC_train))) 
summary(glm.bic)

## 
## Call:
## glm(formula = Choice ~ Gender + Frequency + Last_purchase + P_Child + 
##     P_Youth + P_Cook + P_DIY + P_Art, family = "binomial", data = BBBC_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.00390    0.16493   0.024    0.981    
## Gender1       -0.86499    0.13703  -6.312 2.75e-10 ***
## Frequency     -0.08974    0.01064  -8.436  < 2e-16 ***
## Last_purchase  0.57397    0.07822   7.338 2.16e-13 ***
## P_Child       -0.81070    0.11618  -6.978 3.00e-12 ***
## P_Youth       -0.64318    0.14338  -4.486 7.27e-06 ***
## P_Cook        -0.92741    0.11890  -7.800 6.19e-15 ***
## P_DIY         -0.90981    0.14307  -6.359 2.03e-10 ***
## P_Art          0.67881    0.12538   5.414 6.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1398.9  on 1591  degrees of freedom
## AIC: 1416.9
## 
## Number of Fisher Scoring iterations: 5

# Predict probabilities using final BIC model
predicted_probs <- predict(glm.bic, newdata = BBBC_test, type = "response")

# Convert probabilities to predicted class labels
predicted_class <- ifelse(predicted_probs >= 0.5, "1", "0")

# Ensure both vectors are factors with same levels for confusionMatrix
predicted_class <- factor(predicted_class, levels = c("0", "1"))
actual_class <- factor(BBBC_test$Choice, levels = c("0", "1"))

# Confusion matrix
confusionMatrix(predicted_class, actual_class, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1975  127
##          1  121   77
##                                           
##                Accuracy : 0.8922          
##                  95% CI : (0.8788, 0.9046)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9993          
##                                           
##                   Kappa : 0.324           
##                                           
##  Mcnemar's Test P-Value : 0.7509          
##                                           
##             Sensitivity : 0.37745         
##             Specificity : 0.94227         
##          Pos Pred Value : 0.38889         
##          Neg Pred Value : 0.93958         
##              Prevalence : 0.08870         
##          Detection Rate : 0.03348         
##    Detection Prevalence : 0.08609         
##       Balanced Accuracy : 0.65986         
##                                           
##        'Positive' Class : 1               
##

# Convert predictions to match "0"/"1" levels of actual data
predclass_log_test <- factor(
  predicted_class, 
  levels = c("no", "yes"),  # Original levels in predictions
  labels = c("0", "1")      # New labels to match actual data
)

# Verify levels now match
levels(predicted_class)

## [1] "0" "1"

BBBC_test$Choice <- factor(
  BBBC_test$Choice, 
  levels = c("0", "1")  # Ensure levels match predictions
)
levels(BBBC_test$Choice)

## [1] "0" "1"

levels(predclass_log_test)

## [1] "0" "1"

# Predict probabilities on the TEST data
set.seed(1)
predprob_test <- predict(final_model, newdata = BBBC_test, type = "response")

# Convert probabilities to classes (0/1)
predclass_log_test <- ifelse(predprob_test >= 0.37, "1", "0")

# Convert to factor with levels matching BBBC_test$Choice
predclass_log_test <- factor(
  predclass_log_test, 
  levels = c("0", "1"), 
  labels = c("0", "1")
)
# Convert BBBC_test$Choice to factor (if not already)
BBBC_test$Choice <- factor(
  BBBC_test$Choice, 
  levels = c("0", "1"), 
  labels = c("0", "1")
)

# Verify levels
levels(predclass_log_test) # Should be "0", "1"

## [1] "0" "1"

levels(BBBC_test$Choice)   # Should be "0", "1"

## [1] "0" "1"

confusionMatrix(
  predclass_log_test, 
  BBBC_test$Choice,   # Now both have matching levels
  positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1830   94
##          1  266  110
##                                          
##                Accuracy : 0.8435         
##                  95% CI : (0.828, 0.8581)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2987         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.53922        
##             Specificity : 0.87309        
##          Pos Pred Value : 0.29255        
##          Neg Pred Value : 0.95114        
##              Prevalence : 0.08870        
##          Detection Rate : 0.04783        
##    Detection Prevalence : 0.16348        
##       Balanced Accuracy : 0.70615        
##                                          
##        'Positive' Class : 1              
##

Support Vector Machine (SVM)

SVM finds the hyperplane that best separates classes in high-dimensional space. It’s effective in complex, non-linear problems and performs well when there is a clear margin of separation.

# Scale numeric variables (excluding Gender)
preProc <- preProcess(BBBC_train, method = c("center", "scale"))
BBBC_train_scaled <- predict(preProc, BBBC_train)
BBBC_test_scaled <- predict(preProc, BBBC_test)

# Tune SVM with cross-validation (linear kernel)
set.seed(1)
tune_svm <- tune(svm, 
                 Choice ~ ., 
                 data = BBBC_train_scaled, 
                 kernel = "linear",
                 ranges = list(cost = c(0.01, 0.1, 1, 10)),
                 tunecontrol = tune.control(sampling = "cross", cross = 5)
)
best_svm <- tune_svm$best.model

# Predictions on test data
svm_test_pred <- predict(best_svm, BBBC_test_scaled)

levels(svm_test_pred)          # Check levels of predictions

## [1] "0" "1"

levels(BBBC_test_scaled$Choice)

## [1] "0" "1"

BBBC_test_scaled$Choice <- factor(
  BBBC_test_scaled$Choice, 
  levels = c("0", "1")  # Ensure levels match predictions
)

svm_test_pred <- predict(best_svm, BBBC_test_scaled)
svm_test_pred <- factor(
  svm_test_pred, 
  levels = levels(BBBC_test_scaled$Choice)  # Match reference levels
)

# Confusion matrix and metrics
confusionMatrix(svm_test_pred, BBBC_test_scaled$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2011  147
##          1   85   57
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8861, 0.9111)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9802          
##                                           
##                   Kappa : 0.2768          
##                                           
##  Mcnemar's Test P-Value : 6.206e-05       
##                                           
##             Sensitivity : 0.27941         
##             Specificity : 0.95945         
##          Pos Pred Value : 0.40141         
##          Neg Pred Value : 0.93188         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02478         
##    Detection Prevalence : 0.06174         
##       Balanced Accuracy : 0.61943         
##                                           
##        'Positive' Class : 1               
##

Conclusion

After evaluating all models: - SVM yielded the best accuracy (89.9%) for predicting customer purchases. - Logistic Regression offered valuable insights into the influence of individual predictors with an accuracy of 84.4%

Targeting the top 30% of customers predicted most likely to buy can result in significant cost savings and higher marketing ROI. We recommend using SVM for final prediction, supported by insights from logistic regression for strategic understanding.

BookBinders Case Study

Ashmal Ali, Tia Miller, Lisa Gamez

2025-03-02