Executive Summary

Bookbinders Book Club is considering three types of predictive models to identify which customers the company should target for a new direct marketing campaign, a logistic regression model, support vector machines, and linear regression. Using data collected from Bookbinders’ previous campaign, our goal is to predict whether a customer will purchase a book. In our analysis, we will focus on which model yields the highest sensitivity score, the proportion of customer purchases that were correctly identified, as this will have the greatest financial implications for Bookbinders. The results of our study show that the logistic regression model was able to most accurately predict the proportion of customers who purchased a book at 71.6%. Comparatively, the maximum sensitivity of the support vector classifier was 30.9%. The logistic regression model also provides insights into the most significant attributes that influence a customer’s purchase. Some of these characteristics include the customer’s gender, the amount of time since their last purchase, and the type of book purchased.

The Problem

The BBB Club, Bookbinders Book Club, was established in 1986 for the purpose of selling specialty books through direct marketing. In anticipation of using database marketing, Bookbinders made a strategic decision from the beginning to build and maintain a detailed database about its members containing all relevant information about them. Readers fill out an insert and return it to BBBC, where it is then entered into their database. The company currently has a database of 500,000 readers and sends out a mailing about once every month.

Bookbinders is exploring whether to use predictive modeling approaches to improve the efficacy of its direct mail program. For a recent mailing, the company selected 20,000 customers in Pennsylvania, New York, and Ohio from its database and included with their regular mailing a specially produced brochure for the book, The Art History of Florence. This resulted in a 9.03% response rate (1,806 orders) for the purchase of the book. BBBC then developed a database to calibrate a response model to identify the factors that influenced these purchases.

Bookbinders would like to evaluate three different modeling methods: linear regression, a logit model, and support vector machines. While they want to isolate the factors that most influenced customers to buy a book, they also want to develop a highly accurate model in order to identify which customers they should target in their next mail campaign in the Midwest. The allocated cost of the mailing for these campaigns is $0.65/addressee (including postage) for the art book, and the book costs $15 to purchase and ship. The company allocates overhead to each book at 45% of cost. The selling price of the book is $31.95.

In this study, we will explore which method is able to most accurately identify customers who are most likely to purchase a book and which attributes are most influential in predicting customer purchases. In the remainder of the report, we will discuss additional literature on existing methodologies used in this domain, followed by a discussion of the modeling techniques used in this case study and a description of the data set. Finally, we will present our findings from our models and provide recommendations to help Bookbinders strategically target customers.

Methodology

In our analysis, we will consider two types of classification methods, logistic regression and support vector machines. Logistic regression is a parametric modeling technique that performs under several major assumptions. The first assumption is that there is a linear relationship between the independent variables and the log odds of the response variable. Additionally, the response variable should be categorical, either binary or ordinal, with no high correlations (multicollinearity) among the predictors. The final assumption is that the observations are independent and identically distributed.

Support vector machines (SVMs) on the other hand are a nonparametric statistical learning method, which means that they do not have the same assumptions about the distribution of the data as logistic regression. They involve finding a hyperplane that separates the data as well as possible and can be extended to non-linear class boundaries by adjusting the support vector machine’s kernel. They do require numerical variables in fitting the model, so for categorical features, we can recode them to represent 1 if the attribute exists and 0 otherwise.

We also tried fitting a linear regression model for Bookbinders’ third choice. However, since we would like to estimate the probability of whether a customer will purchase a book, our results were poor. Linear regression is not well-suited for classification tasks because it will estimate probabilities that are less than 0 and greater than 1, which violates the axioms of probability. For optimal results, linear regression requires a continuous outcome variable.

The Data

Before beginning our analysis, we wanted to explore the relationships between the variables in the data set. We can see in the distribution plots of customer orders below, there appears to be a strong relationship in the time between a customer’s first and last purchase with a positive correlation of 0.81. There are no variables with very high correlations above 0.9, indicating a potential multicollinearity issue.

We will be using a subset of the data available to Bookbinders. The data consists of customer responses from Bookbinders previous marketing campaign. Our outcome variable, Choice, is a categorical variable with the value 1 indicating a book was purchased and 0 indicating a non-purchase. Bookbinders also selected nine attributes they thought might explain the observed choice behavior. These features pertain to the customer’s purchasing history, including the frequency of purchases, the total amount spent on BBBC books, and the genre of books purchased (art books, cookbooks, do-it-yourself books, children and youth books), as well as the customer’s gender.

Prior to our analysis, both the training and testing data sets were already split, with 1,600 observations in the training set (bbtrain) and 2,300 observations in the test set (bbtest). Overall, the data sets were quite clean, there were no missing values in either the training or the test set. However, there is one limitation of the data, our target variable, Choice, contains many more instances of customers who did not purchase a book than those who did. In the data set used to train the model, only about 25% of customers ordered a book, disproportionately representing the customers who did not make a purchase. Class imbalance can negatively affect the precision and recall accuracy of the models, especially in support vector machines.

Based on the bar graph of purchase orders by gender, we can see the unbalanced distribution in our target variable. Overall, about 34% of women purchased a book while only about 21% of men did.

When we checked the structure of the data, we saw that there was a variable representing the observation numbers. We decided to remove this column as it is not relevant to our analysis. We then factored our response variable and recoded Gender from binary 0 and 1 values to Female and Male classes respectively before fitting the logistic regression model. Below is a summary of the data used to fit the models.

## tibble [1,600 x 11] (S3: tbl_df/tbl/data.frame)
##  $ Choice          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Gender          : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
##  $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
##  $ Frequency       : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
##  $ Last_purchase   : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
##  $ First_purchase  : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
##  $ P_Child         : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
##  $ P_Youth         : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
##  $ P_Cook          : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
##  $ P_DIY           : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
##  $ P_Art           : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...

Findings

The first modeling technique we evaluated was a support vector machine. We used both linear and non-linear kernels in order to compare the performance of each method. We first tried a Radial Basis Function (RBF) kernel. We tuned the parameters of the model by setting the gamma and cost using the tune.svm() function. We used 8 values for cost in the sequence of {0.1, 1, 10, 100, 200, 400, 500, 1000} and 8 values for gamma in the sequence of {0.01, 0.1, 0.5, 1, 2, 3, 4, 5}. The tune.svm() function will learn SVMs for 8x8 = 64 possible combinations of gamma and cost and store the best parameters in an object called best.model.

We then used the best parameters for gamma and cost to fit our SVM. Based on the results below, the Radial SVM achieved an overall prediction accuracy of 90.3%, with a high specificity at 96.1% but a much lower sensitivity at 30.9%. This means that it is correctly classifying customers who did not make a purchase 96.1% of the time but it is misclassifying customers who did make a purchase nearly 70% of the time.

# Accuracy of Radial SVM
confusionMatrix(as.factor(rad.preds), as.factor(bbtest$Choice), positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2014  141
##          1   82   63
##                                           
##                Accuracy : 0.903           
##                  95% CI : (0.8902, 0.9148)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9222367       
##                                           
##                   Kappa : 0.3102          
##                                           
##  Mcnemar's Test P-Value : 0.0001028       
##                                           
##             Sensitivity : 0.30882         
##             Specificity : 0.96088         
##          Pos Pred Value : 0.43448         
##          Neg Pred Value : 0.93457         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02739         
##    Detection Prevalence : 0.06304         
##       Balanced Accuracy : 0.63485         
##                                           
##        'Positive' Class : 1               
## 

To try to minimize the misclassification rate, we looked at two other types of kernels, a linear kernel and a polynomial kernel. Following the same process above, we tuned the linear SVM with a sequence of different values for the cost parameter and tuned the polynomial SVM using varying degrees and coefficients. However, both the linear and polynomial SVMs performed slightly worse than the radial kernel with a decrease in sensitivity to about 28%, indicating that this method might not be the best choice for our data.

# Accuracy of Linear SVM
confusionMatrix(as.factor(lin.preds), as.factor(bbtest$Choice), positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2011  147
##          1   85   57
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8861, 0.9111)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9802          
##                                           
##                   Kappa : 0.2768          
##                                           
##  Mcnemar's Test P-Value : 6.206e-05       
##                                           
##             Sensitivity : 0.27941         
##             Specificity : 0.95945         
##          Pos Pred Value : 0.40141         
##          Neg Pred Value : 0.93188         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02478         
##    Detection Prevalence : 0.06174         
##       Balanced Accuracy : 0.61943         
##                                           
##        'Positive' Class : 1               
## 

The next model we wanted to consider is a logistic regression model. Due to the imbalance in our data, we looked at the Sensitivity and Specificity plot to find the optimal cutoff level to classify the probabilities. The point where the two lines intersect indicates the decision threshold that makes the fewest number of false positive and false negative classifications. For us, this occurs at 0.23 in the plot below.

We used this threshold as our decision cutoff, which means that if the probability is equal to or greater than 0.23, the prediction will be classified as a 1 (purchase) and if it is less than that value, it will be classified as a 0 (no purchase). Although the overall accuracy decreased, the sensitivity greatly improved to 71.6% using this threshold.

# Accuracy of Logistic model
confusionMatrix(as.factor(pred.glm), as.factor(bbtest$Choice), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1540   58
##          1  556  146
##                                          
##                Accuracy : 0.733          
##                  95% CI : (0.7145, 0.751)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2143         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.71569        
##             Specificity : 0.73473        
##          Pos Pred Value : 0.20798        
##          Neg Pred Value : 0.96370        
##              Prevalence : 0.08870        
##          Detection Rate : 0.06348        
##    Detection Prevalence : 0.30522        
##       Balanced Accuracy : 0.72521        
##                                          
##        'Positive' Class : 1              
## 

Another advantage of using the logistic model over SVMs is the probabilistic interpretations of the covariates. Based on the size of the coefficients and the significance of the predictors, we can determine which features are most influential in predicting whether a customer will purchase a book. Our logistic model shows that all variables have an effect on a customer’s purchasing decision except for the number of months since their first purchase, First_purchase. Based on the coefficient for Gender, women are more likely to order a book than men. In addition, the likelihood of a purchase increases with the total amount spent on BBBC books as well as with the amount of time since their last purchase. On the other hand, those with fewer purchases in the chosen period are less likely to make a purchase. The type of books ordered also affects whether a customer will place an order. Those who have purchased a higher number of art books are more likely to make a purchase, which makes sense since our data is based on customers who bought The Art History of Florence.

## 
## Call:
## glm(formula = Choice ~ ., family = binomial, data = bbtrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.38586  -0.66728  -0.43696  -0.02242   2.72238  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.3515281  0.2143839  -1.640   0.1011    
## GenderMale       -0.8632319  0.1374499  -6.280 3.38e-10 ***
## Amount_purchased  0.0018641  0.0007918   2.354   0.0186 *  
## Frequency        -0.0755142  0.0165937  -4.551 5.35e-06 ***
## Last_purchase     0.6117713  0.0938127   6.521 6.97e-11 ***
## First_purchase   -0.0147792  0.0128027  -1.154   0.2483    
## P_Child          -0.8112489  0.1167067  -6.951 3.62e-12 ***
## P_Youth          -0.6370422  0.1433778  -4.443 8.87e-06 ***
## P_Cook           -0.9230066  0.1194814  -7.725 1.12e-14 ***
## P_DIY            -0.9058697  0.1437025  -6.304 2.90e-10 ***
## P_Art             0.6861124  0.1270176   5.402 6.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1392.2  on 1589  degrees of freedom
## AIC: 1414.2
## 
## Number of Fisher Scoring iterations: 5

Based on the results of our study, we think that Bookbinders could greatly benefit from using an in-house predictive model for their next marketing campaign. Out of the methods we considered, we believe the logistic regression model would have the most positive impact in customer response rates. Since Bookbinders already has a good customer database set up, this process could be simplified and automated for future modeling efforts. One way to implement this is to create an automated SQL query that extracts the selected variables from the data warehouse. The marketing manager could then upload this file into a dashboard and the model would produce a list of customers to send an individual brochure to in their next marketing campaign.

Conclusion

In our analysis, we considered three different modeling techniques: logistic regression, support vector machines with linear, radial, and polynomial kernels, and linear regression. Out of the chosen classification methods, the support vector machines performed the best at predicting the customers who did not purchase a book, the true negative responses, but lacked in predicting customers who did. This is likely due to the class imbalance in the training data set that disproportionately represents the number of customers who did not make a purchase. One disadvantage of support vector machines is that they are highly sensitive to the class distribution and the choice of parameters. To determine which customers the company should target, the logistic regression model was able to most accurately predict the proportion of customers who purchased a book. Our proposed logistic model achieved fairly high results in both classification performance metrics, with 71.6% sensitivity and 73.5% specificity. In order to improve our results going forward, we could try resampling the data in order to create a more balanced class distribution to train the models with. Alternatively, we could explore using a penalized version of the support vector machine algorithm or try other classification techniques that are less sensitive to class imbalance, like random forests.

References

Dasgupta, C. G., Dispensa, G. S., & Ghose, S. (1994). Comparing the predictive performance of a neural network model with some traditional market response models. International Journal of Forecasting, 10(2), 235-244. doi:10.1016/0169-2070(94)90004-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York, NY: Springer.

Ładyżyński, P., Żbikowski, K., & Gawrysiak, P. (2019). Direct marketing campaigns in retail banking with the use of deep learning and random forests. Expert Systems with Applications, 134, 28-35. doi:10.1016/j.eswa.2019.05.020

Appendix

# Load libraries
library(readxl)
library(e1071)
library(caret)
library(ROCR)
library(ggplot2)
library(GGally)
library(tidyverse)
theme_set(theme_minimal())

# Load the data
bbtrain <- read_excel("/Users/Kelli/Documents/Spring/DA6813/BBBC-Train.xlsx")
bbtest <- read_excel("/Users/Kelli/Documents/Spring/DA6813/BBBC-Test.xlsx")

# Check for missing values
anyNA(bbtrain)
## [1] FALSE
anyNA(bbtest)
## [1] FALSE
# Remove first column (observation number) from data
bbtrain <- bbtrain[-c(1)]
bbtest <- bbtest[-c(1)]

# Factor response variable
bbtrain$Choice <- as.factor(bbtrain$Choice)
bbtest$Choice <- as.factor(bbtest$Choice)

# Exploratory Analysis
bbtrain %>% 
  mutate(Gender=as_factor(Gender)) %>% 
  ggpairs(ggplot2::aes(color=as_factor(Choice)),
          title = "Distributions of Customer Purchases",
          upper = list(combo=wrap("box_no_facet", alpha=0.7), discrete=wrap("barDiag", alpha=0.7)),
          diag = list(continuous=wrap("densityDiag", alpha=0.5), discrete=wrap("barDiag", alpha=0.7)),
          lower = list(combo=wrap("box_no_facet", alpha=0.7), continuous=wrap("smooth", alpha=0.5),
                       discrete=wrap("barDiag", alpha=0.7)))

ggplot(data = bbtrain, 
       aes(x = factor(ifelse(Choice == 1, "Book Purchased", "No Purchase" )), 
           fill = factor(ifelse(Gender == 0, "Female", "Male")))) +
  geom_bar(alpha = 0.6, color = "grey20", stat="count") + 
  stat_count(geom = "text", colour = "grey20", size = 3.5,
             aes(label = paste("n = ", ..count..)),
             position=position_stack(vjust=0.5)) +
  labs(title = "Number of Customers Orders by Gender", 
    x = "", y = "", fill="Gender")

# Set seed for reproducibility 
set.seed(1)

# Tuning the Radial Kernel SVM
tuned <- tune.svm(Choice~., data=bbtrain, 
                  cost=c(0.1, 1, 10, 100, 200, 400, 500, 1000), 
                  gamma=c(.01, .1, .5, 1, 2, 3, 4, 5))

# Fit SVM using the values of the best parameters 
svm.rad <- svm(Choice ~ ., data=bbtrain, gamma=tuned$best.parameters$gamma, 
               cost=tuned$best.parameters$cost)

# Summary of radial SVM
summary(svm.rad)
## 
## Call:
## svm(formula = Choice ~ ., data = bbtrain, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  725
## 
##  ( 346 379 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
# Make predictions on test set 
rad.preds <- predict(svm.rad, bbtest, type='response')

# Accuracy of Radial SVM
confusionMatrix(as.factor(rad.preds), as.factor(bbtest$Choice), positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2014  141
##          1   82   63
##                                           
##                Accuracy : 0.903           
##                  95% CI : (0.8902, 0.9148)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9222367       
##                                           
##                   Kappa : 0.3102          
##                                           
##  Mcnemar's Test P-Value : 0.0001028       
##                                           
##             Sensitivity : 0.30882         
##             Specificity : 0.96088         
##          Pos Pred Value : 0.43448         
##          Neg Pred Value : 0.93457         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02739         
##    Detection Prevalence : 0.06304         
##       Balanced Accuracy : 0.63485         
##                                           
##        'Positive' Class : 1               
## 
# Tuning the Linear SVM
tuned <- tune(svm, Choice~., data = bbtrain, kernel = "linear",
              ranges = list(cost = c(0.001, 0.01, 0.05, 0.1, 1, 5, 10, 100)))

# Extract the best model
svm.lin <- tuned$best.model

# Make predictions on test set 
lin.preds <- predict(svm.lin, bbtest, type='response')

# Confusion matrix
confusionMatrix(as.factor(lin.preds), as.factor(bbtest$Choice), positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2011  147
##          1   85   57
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8861, 0.9111)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9802          
##                                           
##                   Kappa : 0.2768          
##                                           
##  Mcnemar's Test P-Value : 6.206e-05       
##                                           
##             Sensitivity : 0.27941         
##             Specificity : 0.95945         
##          Pos Pred Value : 0.40141         
##          Neg Pred Value : 0.93188         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02478         
##    Detection Prevalence : 0.06174         
##       Balanced Accuracy : 0.61943         
##                                           
##        'Positive' Class : 1               
## 
# Tuning the Polynomial SVM
tuned <- tune(svm, Choice~., data = bbtrain, kernel = "polynomial",
              ranges = list(degree = c(2, 3, 4), 
                            coef0 = c(0.01, 0.1, 0.5, 1, 2, 3, 4, 5)))

# Extract the best model
svm.poly <- tuned$best.model

# Make predictions on test set 
poly.preds <- predict(svm.poly, bbtest, type='response')

# Confusion matrix
confusionMatrix(as.factor(poly.preds), as.factor(bbtest$Choice), positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2034  147
##          1   62   57
##                                           
##                Accuracy : 0.9091          
##                  95% CI : (0.8966, 0.9206)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.6597          
##                                           
##                   Kappa : 0.3077          
##                                           
##  Mcnemar's Test P-Value : 6.232e-09       
##                                           
##             Sensitivity : 0.27941         
##             Specificity : 0.97042         
##          Pos Pred Value : 0.47899         
##          Neg Pred Value : 0.93260         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02478         
##    Detection Prevalence : 0.05174         
##       Balanced Accuracy : 0.62492         
##                                           
##        'Positive' Class : 1               
## 
# Recode Gender for Logistic model
bbtrain$Gender <- ifelse(bbtrain$Gender==0, "Female", "Male")
bbtest$Gender <- ifelse(bbtest$Gender==0, "Female", "Male")

# Fit logistic model
glm.fit <- glm(Choice ~ ., data = bbtrain, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Choice ~ ., family = binomial, data = bbtrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.38586  -0.66728  -0.43696  -0.02242   2.72238  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.3515281  0.2143839  -1.640   0.1011    
## GenderMale       -0.8632319  0.1374499  -6.280 3.38e-10 ***
## Amount_purchased  0.0018641  0.0007918   2.354   0.0186 *  
## Frequency        -0.0755142  0.0165937  -4.551 5.35e-06 ***
## Last_purchase     0.6117713  0.0938127   6.521 6.97e-11 ***
## First_purchase   -0.0147792  0.0128027  -1.154   0.2483    
## P_Child          -0.8112489  0.1167067  -6.951 3.62e-12 ***
## P_Youth          -0.6370422  0.1433778  -4.443 8.87e-06 ***
## P_Cook           -0.9230066  0.1194814  -7.725 1.12e-14 ***
## P_DIY            -0.9058697  0.1437025  -6.304 2.90e-10 ***
## P_Art             0.6861124  0.1270176   5.402 6.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1392.2  on 1589  degrees of freedom
## AIC: 1414.2
## 
## Number of Fisher Scoring iterations: 5
# Make predictions for ROC plot
preds <- prediction(predict(glm.fit, bbtrain, type='response'), bbtrain$Choice)

# Plot Area under curve
auc <- round(as.numeric(performance(preds, measure = "auc")@y.values), 3)
perf <- performance(preds, "tpr","fpr")
plot(perf, colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("Area Under the Curve:", auc))

# Sensitivity vs. Specificity Plot
plot(unlist(performance(preds, "sens")@x.values), 
     unlist(performance(preds, "sens")@y.values), type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", 
     main = paste("Sensitivity vs. Specificity Plot"))

par(new=TRUE)

plot(unlist(performance(preds, "spec")@x.values), 
     unlist(performance(preds, "spec")@y.values), type="l", lwd=2, col='red', 
     ylab="", xlab="")
axis(4, at=seq(0,1,0.2))
mtext("Specificity",side=4, col='red')

# Find intersection
min.diff <- which.min(abs(unlist(performance(preds, "sens")@y.values) - unlist(performance(preds, "spec")@y.values)))
min.x <- unlist(performance(preds, "sens")@x.values)[min.diff]
min.y <- unlist(performance(preds, "spec")@y.values)[min.diff]
optimal <- min.x

abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("Optimal Threshold = ", round(optimal,2)), pos = 4)

# Make predictions on test set
prob.glm <- predict.glm(glm.fit, newdata = bbtest, type="response")

# Convert probabilities to binary
pred.glm <- ifelse(prob.glm >= optimal, 1, 0)

# Check the accuracy of logistic model
confusionMatrix(as.factor(pred.glm), as.factor(bbtest$Choice), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1540   58
##          1  556  146
##                                          
##                Accuracy : 0.733          
##                  95% CI : (0.7145, 0.751)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2143         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.71569        
##             Specificity : 0.73473        
##          Pos Pred Value : 0.20798        
##          Neg Pred Value : 0.96370        
##              Prevalence : 0.08870        
##          Detection Rate : 0.06348        
##    Detection Prevalence : 0.30522        
##       Balanced Accuracy : 0.72521        
##                                          
##        'Positive' Class : 1              
## 
# Linear Regression
lm.fit <- lm(as.numeric(Choice)~., data = bbtrain)

# Diagnostics
par(mfrow=c(2,2))
plot(lm.fit)