I. Executive Summary

In order to maximize marketing campaigns and ensure that the most likely customers are being targeted, businesses are using analytics to identify potential customers. A targeted marketing campaign can minimize spending waste associated with mailing campaigns that are not being sent to the customers most likely to purchase the product being marketed.

This case study will focus on finding an optimal model for a targeted mailing campaign by the Bookbinders Book Club (BBBC). We will attempt to use predictive modeling approaches to the efficacy of a BBBC direct mailing campaign. We will be using a subset of the BBBC database consisting of 400 customers who purchased the book and 1200 who did not. The dependent variable for our models will be Choice representing whether or not the customer purchased the book.

A logistic regression model resulted in the highest accuracy and was best able to predict whether or not a customer on the mailing list would purchase the book. The independent variables in the model are Amount_purchased, Frequency, First_Purchase, P_Child, P_Youth, P_Cook, P_DIY, and Gender.

II. The Problem

a. Introduction/Background

Bookbinders Book Club was established in 1986 as a book distributor and kept a database of their 500,000 readers for marketing purposes. This case study is designed to explore part of the data within three different modeling methods to isolate the model that will bring BBBC the most profit.

b. Purpose of study/importance of study/statement of problem

The three models taken into consideration for this case are: logistic regression, LDA model and support vector machines (SVM). The advantages and disadvantages of each method will be analyzed and compared to find the best model to help BBBC minimize their spending on marketing campaigns while gaining the most profit possible.

c. Questions to be answered/conceptual statement of hypotheses.

Which model provides the best prediction for a Purchase response?
Can we use linear regression model?
Does BBBC make more profit when using a classification model? If so, which model?
How much profit does BBBC make?
Was the hypothesis accurate?

d. Outline of remainder of report

As previously stated, this case is based on finding the most cost-efficient predictive model to maximize profit in direct marketing campaigns. It will prove that using statistical analysis within a company can minimize waste costs while helping profits rise each year.

IV. Methodology

a. Identification, classification, and operationalization of variables.

Independent variable:

Amount_purchase (numeric): total money spent on BBBC books
Frequency (numeric): total number of purchases in the chosen period
First_purchase (numeric): months since first purchase
Last_purchase (numeric): months since last purchase
P_Child (numeric): number of children’s books purchased
P_Youth (numeric): number of youth books purchased
P_Cook(numeric): number of cook books purchased
P_DIY (numeric): number of do-it-yourself books purchased
P_Art (numeric): number of art books purchased
Gender (numeric): 0 and 1. We will change them into categorical of Female and Male later

Dependent variable:

Choice (numeric): 0 and 1. We will change them into categorical ofPurchase or Non-Purchase later

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following objects are masked from 'package:openintro':
## 
##     housing, mammals
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(corrplot)

## corrplot 0.92 loaded

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:openintro':
## 
##     densityPlot
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(readr)
library(e1071)
library(readxl)


### Read in data
CS2Test <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Test.xlsx")
str(CS2Test)

## tibble [2,300 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Observation     : num [1:2300] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Choice          : num [1:2300] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : num [1:2300] 0 1 1 0 1 0 1 1 1 1 ...
##  $ Amount_purchased: num [1:2300] 287 215 261 24 120 66 42 233 66 199 ...
##  $ Frequency       : num [1:2300] 12 4 2 4 8 2 12 8 12 22 ...
##  $ Last_purchase   : num [1:2300] 4 1 1 1 1 4 1 2 1 1 ...
##  $ First_purchase  : num [1:2300] 24 4 2 4 8 16 12 12 12 22 ...
##  $ P_Child         : num [1:2300] 0 0 0 1 0 0 0 0 0 0 ...
##  $ P_Youth         : num [1:2300] 3 0 0 0 0 0 0 0 0 0 ...
##  $ P_Cook          : num [1:2300] 0 0 0 0 0 1 1 0 0 0 ...
##  $ P_DIY           : num [1:2300] 0 0 0 0 0 1 0 0 0 0 ...
##  $ P_Art           : num [1:2300] 1 1 1 0 1 1 0 0 0 1 ...

CS2Train <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Train.xlsx")
str(CS2Train)

## tibble [1,600 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Observation     : num [1:1600] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Choice          : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
##  $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
##  $ Frequency       : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
##  $ Last_purchase   : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
##  $ First_purchase  : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
##  $ P_Child         : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
##  $ P_Youth         : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
##  $ P_Cook          : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
##  $ P_DIY           : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
##  $ P_Art           : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...

b. Statements of hypotheses being tested and/or models being developed.

The null hypothesis is that all coefficients in the model are equal to zero, meaning none of the predictor variables have a statistically significant relationship with the response variable, Choice. The alternative hypothesis states that not every coefficient is equal to zero and some predictor variables would have a significant relationship to Choice.

c. Sampling techniques, if full data is not being used.

At first, we create the model with the CS2Train data set and test it on CS2Test data set. However, since the amount of Non-purchase is much larger than Purchase, our model has a decent accuracy of 66% while sensitivity is 75% and specificity is 65.12%

Therefore, we decide to create a subset of this data called df_subset with equal observations of Purchase and Non-purchase. From the balanced data we used an 80/20 split to create a training and testing dataset.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

The data was taken from a recent BBBC mailing campaign sent to 20,000 customers in Pennsylvania, New York, and Ohio. From that dataset, a subset was created consisting of 400 customers who purchased the book and 1200 who did not.

e. Modeling analysis/techniques used

We are developing a logistic regression, linear discriminate analysis, and support vector machine model to test what variables would have impact on the success of customers buying the books. After viewing the accuracy, sensitivity, and specificity of each model, we then can conclude which one fits the data the best.

f. Methodological assumptions and limitations.

Logistic regression has the following assumptions: - Dependent variable must be categorical and binary, which they are in this case - The observations must be independent of one another - Multicollinearity should be investigated and eliminated. Independent variables should not be correlated with one another - There must be a large sample size

LDA has these assumptions: - Sample measurements are independent from each other - Distributions are normal - Co-variance of the measurements are identical across different classes.

SVM model assumptions: - The margin should be as large as possible. - The data is independent and identically distributed - However, it doesn’t assume normality of the data

g. Please also summarize the advantages and disadvantages of the three models, as you experienced in the modeling exercise. Should the company develop expertise in either (or all) of these methods to develop in-house capability to evaluate its direct mail campaigns. (extra questions from this case study)

The main advantage of the logistic regression model in this case study is that the cut point can be changed to increase or decrease overall model accuracy, sensitivity, or specificity as needed depending on which measure is most beneficial in analyzing the data. This allowed us to find a nice balance between the three measures. The disadvantage of the LDA model and SVM model in this case was that they were less accurate than the logistic regression even before we changed the cut point. The flexibility of the logistic regression model made it more appropriate for analyzing this dataset.

BBBC should develop expertise in multiple different methods. For this specific dataset, it is important to have some knowledge of logistic regression to analyze and understand the data. Future datasets will not be identical to this one, so it would also be beneficial to understand other methods to help analyze and achieve the best results possible for the company.

h. How would you simplify and automate your recommended method(s) for future modeling efforts at the company (extra questions from this case study)

One way to automate and simplify the logistic regression model for future use would be to build a script. Since the data has been analyzed once already it would be beneficial to copy the step-by-step process to a script and then simply update the dataset for future use. Along with having a script for running the regression, a schedule could also be used to determine when to reanalyze the data and update the model as well as when the results of the testing will be reported to any stakeholders.

V. Data

a. Data cleaning

Using str() function, we realize that there are two variables with wrong classification. We proceed to create two new columns:

Gender_new that is a categorical variable with 2 factors: Male and Female
Choice_new that is also a categorical variable with 2 factors: Purchase and Nonpurchase

# Create new column for gender and make it a factor 

CS2Train$Gender_new <- ifelse(CS2Train$Gender == 0, "Female", "Male") 
CS2Train$Gender_new <- as.factor(CS2Train$Gender_new) 

CS2Test$Gender_new <- ifelse(CS2Test$Gender == 0, "Female", "Male") 
CS2Test$Gender_new <- as.factor(CS2Test$Gender_new)

# Create new column for choice and make it a factor 

CS2Train$Choice_new <- ifelse(CS2Train$Choice == 0, "Nonpurchase", "Purchase") 
CS2Train$Choice_new <- as.factor(CS2Train$Choice_new) 

CS2Test$Choice_new <- ifelse(CS2Test$Choice == 0, "Nonpurchase", "Purchase") 
CS2Test$Choice_new <- as.factor(CS2Test$Choice_new)

Data distribution: By creating a histogram of each numeric variables, we can see that only Amount_purchased has a normal distribution while other variables are right-skewed, which can dratiscally affect the result of linear regression model, if we decide to use it.

CS2Train$Last_purchase <- as.numeric(CS2Train$Last_purchase)
par(mfrow=c(3,3))
hist(CS2Train$Amount_purchased, xlab = "Amount Purchased")
hist(CS2Train$Frequency, xlab = "Total Num. Purchases in Study")
hist(CS2Train$Last_purchase, xlab = "Months Since Last Purchase")
hist(CS2Train$First_purchase, xlab = "Months Since First Purchase")
hist(CS2Train$P_Child, xlab = "No. Children's Books Purchased")
hist(CS2Train$P_Youth, xlab = "No. Youth Books Purchased")
hist(CS2Train$P_Cook, xlab = "No. Cook Books Purchased")
hist(CS2Train$P_DIY, xlab = "No. DIY Books Purchased")
hist(CS2Train$P_Art, xlab = "No. Art Books Purchased")

b. Data preprocessing

By creating a correlation plot, we notice that Last_purchase has high VIF for both Training and Testing data so we remove that variable from the dataset. Our group also proceed to remove P_art as they highly correlated with the dependent variable, as well as Gender, Choice, and Observation since we don’t need them in the dataset

# Remove Gender and Choice because numeric, Observation cuz we dont need.  
# Remove P_Art correlates w dependent variable 
# Last Purchase has high VIF of 15 for training and 14 for testing 

CS2Train = dplyr::select(CS2Train, -c(Gender,Choice,Observation,P_Art, Last_purchase)) 
CS2Test = dplyr::select(CS2Test, -c(Gender,Choice,Observation,P_Art, Last_purchase)) 

### Create correlation plot 
# Create correlation plot (must have numerical values) 

train_num = select_if(CS2Train, is.numeric) 
M1 = cor(train_num) 
corrplot(M1, method = "number") #Last purchase and first purchase has 0.81 correlation then w p_child and p_cook

c. Data limitations

Because both the Training and Testing dataset has unequal observations of Purchase and Non_Purchase, which results in a good model of linear regression but bad LDA and SVM for Sensitivity of predicting a Purchase response, we create a new subset from the CS2Test dataset with equal number of observations for Purchase and Non_Purchase

### Logistic Regression model with original data
# Fit a logistic regression model for CS2Train 
m1.log = glm(Choice_new ~ ., data = CS2Train, family = binomial) 
summary(m1.log)

## 
## Call:
## glm(formula = Choice_new ~ ., family = binomial, data = CS2Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.07505  -0.74193  -0.51191   0.02636   2.69320  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.0804606  0.2036865   0.395 0.692827    
## Amount_purchased  0.0030341  0.0007417   4.091 4.30e-05 ***
## Frequency        -0.1743342  0.0141715 -12.302  < 2e-16 ***
## First_purchase    0.0859290  0.0090296   9.516  < 2e-16 ***
## P_Child          -0.5508914  0.0864517  -6.372 1.86e-10 ***
## P_Youth          -0.3854852  0.1164069  -3.312 0.000928 ***
## P_Cook           -0.7124793  0.0910308  -7.827 5.00e-15 ***
## P_DIY            -0.5875044  0.1168082  -5.030 4.91e-07 ***
## Gender_newMale   -0.7896791  0.1295524  -6.095 1.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1536.6  on 1591  degrees of freedom
## AIC: 1554.6
## 
## Number of Fisher Scoring iterations: 5

vif(m1.log)

## Amount_purchased        Frequency   First_purchase          P_Child 
##         1.199103         1.957906         5.617250         1.950201 
##          P_Youth           P_Cook            P_DIY       Gender_new 
##         1.292000         2.055848         1.454268         1.018119

# Make prediction for logistic regression model for CS2Test 
predprob = predict.glm(m1.log, newdata = CS2Test, type = "response") 
predclass_log = ifelse(predprob >= 0.228,"Purchase", "Nonpurchase") 
caret::confusionMatrix(as.factor(predclass_log), as.factor(CS2Test$Choice_new), positive = "Purchase")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Nonpurchase Purchase
##   Nonpurchase        1365       51
##   Purchase            731      153
##                                           
##                Accuracy : 0.66            
##                  95% CI : (0.6402, 0.6794)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1602          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75000         
##             Specificity : 0.65124         
##          Pos Pred Value : 0.17308         
##          Neg Pred Value : 0.96398         
##              Prevalence : 0.08870         
##          Detection Rate : 0.06652         
##    Detection Prevalence : 0.38435         
##       Balanced Accuracy : 0.70062         
##                                           
##        'Positive' Class : Purchase        
##

# testing LDA model with original data:
ori.lda = lda(Choice_new ~ ., data = CS2Train)
origpred_lda = predict(ori.lda, newdata = CS2Test)
caret::confusionMatrix(as.factor(origpred_lda$class), as.factor(CS2Test$Choice_new), positive = "Purchase")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Nonpurchase Purchase
##   Nonpurchase        2003      162
##   Purchase             93       42
##                                           
##                Accuracy : 0.8891          
##                  95% CI : (0.8756, 0.9017)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.1906          
##                                           
##  Mcnemar's Test P-Value : 2.06e-05        
##                                           
##             Sensitivity : 0.20588         
##             Specificity : 0.95563         
##          Pos Pred Value : 0.31111         
##          Neg Pred Value : 0.92517         
##              Prevalence : 0.08870         
##          Detection Rate : 0.01826         
##    Detection Prevalence : 0.05870         
##       Balanced Accuracy : 0.58076         
##                                           
##        'Positive' Class : Purchase        
##

Here we start to create a new subset of data:

# load data again so we have columns that we deleted earlier 

CS2Test2 <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Test.xlsx")
CS2Train2 <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Train.xlsx")


CS2Train2$Gender_new <- ifelse(CS2Train2$Gender == 0, "Female", "Male")  
CS2Train2$Gender_new <- as.factor(CS2Train2$Gender_new) 

CS2Test2$Gender_new <- ifelse(CS2Test2$Gender == 0, "Female", "Male") 
CS2Test2$Gender_new <- as.factor(CS2Test2$Gender_new) 

CS2Train2$Choice_new <- ifelse(CS2Train2$Choice == 0, "Nonpurchase", "Purchase") 
CS2Train2$Choice_new <- as.factor(CS2Train2$Choice_new) 

CS2Test2$Choice_new <- ifelse(CS2Test2$Choice == 0, "Nonpurchase", "Purchase") 
CS2Test2$Choice_new <- as.factor(CS2Test2$Choice_new) 

# Create a subset of the dataset with only purchase and nonpurchase categories 
set.seed(1) 
df_subset <- filter(CS2Train2, Choice_new == "Purchase" | Choice_new == "Nonpurchase")  

# Determine the number of observations for each category 

num_purchase <- sum(df_subset$Choice_new == "Purchase")  
num_nonpurchase <- sum(df_subset$Choice_new == "Nonpurchase") 

# Sample the larger category (if any) to have the same number of observations as the smaller category 

if (num_purchase > num_nonpurchase) { 
  df_purchase <- filter(df_subset, Choice_new == "Purchase") 
  df_nonpurchase <- filter(df_subset, Choice_new == "Nonpurchase") 
  df_purchase <- sample_n(df_purchase, num_nonpurchase) 
} else { 
  df_purchase <- filter(df_subset, Choice_new == "Purchase") 
  df_nonpurchase <- filter(df_subset, Choice_new == "Nonpurchase") 
  df_nonpurchase <- sample_n(df_nonpurchase, num_purchase) 
} 

# Combine the two subsets into a new dataset using rbind() 

df_new <- rbind(df_purchase, df_nonpurchase) 

set.seed(1)
tr_ind = sample(nrow(df_new), 0.8*nrow(df_new), replace = F)
CS2Train_new = df_new[tr_ind,]
CS2Test_new = df_new[-tr_ind,]

# Check if df_purchase and df_nonpurchase has same row 

nrow(df_purchase)

## [1] 400

nrow(df_nonpurchase)

## [1] 400

df_new = dplyr::select(df_new, -c(Gender,Choice,Observation,P_Art, Last_purchase)) 
CS2Test_new = dplyr::select(CS2Test_new, -c(Gender,Choice,Observation,P_Art, Last_purchase)) 
CS2Train_new = dplyr::select(CS2Train_new, -c(Gender,Choice,Observation,P_Art, Last_purchase))

VI. Findings (Results)

a. Interpret the results of the models. In particular, for models the influential covariates and their coefficients, provide insights (extra question from this case study) (results with the df_subset)

i. Logistic Regression model

# Logistic Regression model 

# Fit a logistic regression model for df_new 
m1.log = glm(Choice_new ~ ., data = CS2Train_new, family = binomial) #logistic model with CS2Train2 
summary(m1.log)

## 
## Call:
## glm(formula = Choice_new ~ ., family = binomial, data = CS2Train_new)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.38526  -0.98555  -0.00276   0.96207   2.46864  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.297872   0.282061   4.601 4.20e-06 ***
## Amount_purchased  0.002672   0.001043   2.562 0.010420 *  
## Frequency        -0.181851   0.019409  -9.369  < 2e-16 ***
## First_purchase    0.081910   0.012755   6.422 1.35e-10 ***
## P_Child          -0.502231   0.116713  -4.303 1.68e-05 ***
## P_Youth          -0.240124   0.158149  -1.518 0.128929    
## P_Cook           -0.785590   0.132580  -5.925 3.12e-09 ***
## P_DIY            -0.559802   0.167079  -3.351 0.000807 ***
## Gender_newMale   -0.728652   0.184407  -3.951 7.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 887.23  on 639  degrees of freedom
## Residual deviance: 740.93  on 631  degrees of freedom
## AIC: 758.93
## 
## Number of Fisher Scoring iterations: 4

# Make prediction for logistic regression model for CS2Test
predprob = predict.glm(m1.log, newdata = CS2Test_new, type = "response") 
predclass_log = ifelse(predprob >= 0.48,"Purchase", "Nonpurchase") 
caret::confusionMatrix(as.factor(predclass_log), as.factor(CS2Test_new$Choice_new), positive = "Purchase")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Nonpurchase Purchase
##   Nonpurchase          53       29
##   Purchase             27       51
##                                           
##                Accuracy : 0.65            
##                  95% CI : (0.5707, 0.7236)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 9.155e-05       
##                                           
##                   Kappa : 0.3             
##                                           
##  Mcnemar's Test P-Value : 0.8937          
##                                           
##             Sensitivity : 0.6375          
##             Specificity : 0.6625          
##          Pos Pred Value : 0.6538          
##          Neg Pred Value : 0.6463          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3187          
##    Detection Prevalence : 0.4875          
##       Balanced Accuracy : 0.6500          
##                                           
##        'Positive' Class : Purchase        
##

Based on the result, we have a logistic regression model: Choice = 1.298 + 0.003 amount_purchase – 0.181 frequency + 0.082 first_purchase – 0.502 p_child – 0.240 p_youth – 0.786p_cook – 0.560 p_DIY – 0.729 Gender_newMale

The model suggests that when total money spent (amount_purchased) and months since first_purchase increase, customers would more likely to buy the Art of History of Florence. While increasing of spending in other books such as youth, cook book, etc will decrease the chance of buying AHF

The logistic model also have a moderate accuracy of 65%, and a moderate score of predicting the purchase of AHF books with sensitivity and specificity of 63.75% and 66.25% respectively

ii. LDA Model

m1.lda = lda(Choice_new ~ ., data = CS2Train_new)
m1.lda

## Call:
## lda(Choice_new ~ ., data = CS2Train_new)
## 
## Prior probabilities of groups:
## Nonpurchase    Purchase 
##         0.5         0.5 
## 
## Group means:
##             Amount_purchased Frequency First_purchase  P_Child  P_Youth  P_Cook
## Nonpurchase          194.875  13.70625       23.06875 0.753125 0.309375 0.78125
## Purchase             218.650   8.90625       22.69375 0.775000 0.371875 0.67500
##               P_DIY Gender_newMale
## Nonpurchase 0.36250       0.659375
## Purchase    0.38125       0.540625
## 
## Coefficients of linear discriminants:
##                           LD1
## Amount_purchased  0.002840555
## Frequency        -0.174529019
## First_purchase    0.078326151
## P_Child          -0.498301513
## P_Youth          -0.239746786
## P_Cook           -0.747554348
## P_DIY            -0.558689379
## Gender_newMale   -0.735903929

predclass_lda = predict(m1.lda, newdata = CS2Test_new)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(CS2Test_new$Choice_new), positive = "Purchase")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Nonpurchase Purchase
##   Nonpurchase          53       30
##   Purchase             27       50
##                                           
##                Accuracy : 0.6438          
##                  95% CI : (0.5643, 0.7178)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.0001719       
##                                           
##                   Kappa : 0.2875          
##                                           
##  Mcnemar's Test P-Value : 0.7910815       
##                                           
##             Sensitivity : 0.6250          
##             Specificity : 0.6625          
##          Pos Pred Value : 0.6494          
##          Neg Pred Value : 0.6386          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3125          
##    Detection Prevalence : 0.4813          
##       Balanced Accuracy : 0.6438          
##                                           
##        'Positive' Class : Purchase        
##

The LDA model also show a positive correlation between amount_purchased and first_purchase to the likelihood of buying AHF books.

When, comparing to the logistic regression model, the LDA performance was slightly worse in Specificity. Its accuracy is 64.38%. Sensitivity is slightly higher to log reg model at 62.50% while Specificity is lower at 66.25%

iii. SVM Model

form1 = Choice_new ~.
tuned = tune.svm(form1, data=CS2Train_new, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))

mysvm = svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost)
summary(mysvm)

## 
## Call:
## svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  497
## 
##  ( 246 251 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Nonpurchase Purchase

mysvm

## 
## Call:
## svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  497

svmpredict = predict(mysvm, newdata = CS2Test_new, type = 'response')
caret::confusionMatrix(as.factor(svmpredict), as.factor(CS2Test_new$Choice_new), positive = "Purchase")

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Nonpurchase Purchase
##   Nonpurchase          55       33
##   Purchase             25       47
##                                           
##                Accuracy : 0.6375          
##                  95% CI : (0.5579, 0.7119)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.0003147       
##                                           
##                   Kappa : 0.275           
##                                           
##  Mcnemar's Test P-Value : 0.3580197       
##                                           
##             Sensitivity : 0.5875          
##             Specificity : 0.6875          
##          Pos Pred Value : 0.6528          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2938          
##    Detection Prevalence : 0.4500          
##       Balanced Accuracy : 0.6375          
##                                           
##        'Positive' Class : Purchase        
##

The SVM model also produces a decent score of accuracy of 63.75%, which is lower than both LogReg, and LDA model. Its accuracy of predicting a “Purchase” is 58.75% and predicting a “Non Purchase” is 68.75%

iv. Comparing the model

Overall, Logistic Regression model is the best out of 3 three models, with slightly higher of accuracy, sensitivity, and specificity. SVM model has the worst sensitivity, which is below 60%

b. BBBC is considering a similar mail campaign in the Midwest where it has data for 50,000 customers. Such mailings typically promote several books. The allocated cost of the mailing is $0.65/addressee (including postage) for the art book, and the book costs $15 to purchase and mail. The company allocates overhead to each book at 45% of cost. The selling price of the book is $31.95. Based on the model, which customers should Bookbinders target? How much more profit would you expect the company to generate using these models as compare to sending the mail offer to the entire list. (extra question from this case study)

First we calculate the profit of each book sold. With the profit per book is $10.2

# Profit of each book 
mailing = 0.65
manu_cost = 15
overhead_cost = 0.45*15

cost_book = manu_cost + overhead_cost
print(paste("Cost per book is: ", cost_book))

## [1] "Cost per book is:  21.75"

selling_price = 31.95
profit_book = selling_price - cost_book
print(paste("Profit per book is: ", profit_book))

## [1] "Profit per book is:  10.2"

Based on past mailing campaign, the percentage of people will buy the book is only at 9.03% if we mail to all customer. Using the same number for the campaign this year, the profit of mailing to all 50,000 customers without using any prediction model is calculated. The profit only as low as $13,553.

# rate for this year is:
mail_cost_nomodel = mailing*50000
mail_cost_nomodel

## [1] 32500

response_no_model = 0.0903*50000
response_no_model

## [1] 4515

profit_no_model = response_no_model*profit_book - mail_cost_nomodel
print(paste("The profit if we use no model is: ", profit_no_model))

## [1] "The profit if we use no model is:  13553"

However, when using different models we build, the profit increases significantly. - Logistic regression model will increase profit to $138,187.50 - LDA model will increase profit to $135,312.50 - SVM model will increase profit to $127,312.50

# Profit from log model 
log_num_mail = ((27+51)/160)*50000
log_buy = (51/160)*50000
log_profit = log_buy*profit_book - log_num_mail
print(paste("The profit with logistic regression model is: ", log_profit))

## [1] "The profit with logistic regression model is:  138187.5"

# Profit from lda model 
lda_num_mail = ((27+50)/160)*50000
lda_buy = (50/160)*50000
lda_profit = lda_buy*profit_book - lda_num_mail
print(paste("The profit with LDA model is: ", lda_profit))

## [1] "The profit with LDA model is:  135312.5"

# Profit from svm model (sensitivity is 65%)
svm_num_mail = ((25+47)/160)*50000
svm_buy = (47/160)*50000
svm_profit = svm_buy*profit_book - svm_num_mail
print(paste("The profit with SVM model is: ", svm_profit))

## [1] "The profit with SVM model is:  127312.5"

Overall, using 3 three models, BBBC can profit as much as $100,000 more than when not using any models. With logistic regression model, we know that total money spent amount_purchased and months since first purchase first_purchase are positively correlated with the response of Purchase. Based on the coefficient for Gender, women are more likely to order a book than men. Therefore, BBBC should target customer who are women, has spent the most money, but hasn’t purchased a book recently.

VII. Conclusions and Recommendations

a. Discuss alternative methodologies

A linear regression model is not appropriate because the dependent variables are categorical, therefore providing poor results. Since the dependent variable doesn’t have the same number of observations, we need to create a subset of data that Purchase and Non_Purchase have same number of observations. Logistic Regression performs slightly better than other two models of LDA and SVM. However, if we know how to apply these models into real business world, such as the case of BBBC mailing campaign, we can create a much higher profit while minimizing cost.

MS 4203 BookBinder Case Study

Tu Thai, Drew Hinck, Nicole Hayek

2023-02-23

I. Executive Summary

II. The Problem

a. Introduction/Background

b. Purpose of study/importance of study/statement of problem

c. Questions to be answered/conceptual statement of hypotheses.

d. Outline of remainder of report

IV. Methodology

a. Identification, classification, and operationalization of variables.

b. Statements of hypotheses being tested and/or models being developed.

c. Sampling techniques, if full data is not being used.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

e. Modeling analysis/techniques used

f. Methodological assumptions and limitations.

h. How would you simplify and automate your recommended method(s) for future modeling efforts at the company (extra questions from this case study)

V. Data

a. Data cleaning

b. Data preprocessing

c. Data limitations

VI. Findings (Results)

a. Interpret the results of the models. In particular, for models the influential covariates and their coefficients, provide insights (extra question from this case study) (results with the df_subset)

i. Logistic Regression model

ii. LDA Model

iii. SVM Model

iv. Comparing the model

VII. Conclusions and Recommendations

a. Discuss alternative methodologies

MS 4203 BookBinder Case Study

Tu Thai, Drew Hinck, Nicole Hayek

2023-02-23

I. Executive Summary

II. The Problem

a. Introduction/Background

b. Purpose of study/importance of study/statement of problem

c. Questions to be answered/conceptual statement of hypotheses.

d. Outline of remainder of report

III. Review of Related Literature

a. Acquaint reader with existing methodologies used in this area.

IV. Methodology

a. Identification, classification, and operationalization of variables.

b. Statements of hypotheses being tested and/or models being developed.

c. Sampling techniques, if full data is not being used.

d. Data collection process, including data sources, data size, etc. Primary/secondary?

e. Modeling analysis/techniques used

f. Methodological assumptions and limitations.

h. How would you simplify and automate your recommended method(s) for future modeling efforts at the company (extra questions from this case study)

V. Data

a. Data cleaning

b. Data preprocessing

c. Data limitations

VI. Findings (Results)

a. Interpret the results of the models. In particular, for models the influential covariates and their coefficients, provide insights (extra question from this case study) (results with the df_subset)

i. Logistic Regression model

ii. LDA Model

iii. SVM Model

iv. Comparing the model

VII. Conclusions and Recommendations

a. Discuss alternative methodologies