In order to maximize marketing campaigns and ensure that the most likely customers are being targeted, businesses are using analytics to identify potential customers. A targeted marketing campaign can minimize spending waste associated with mailing campaigns that are not being sent to the customers most likely to purchase the product being marketed.
This case study will focus on finding an optimal model for a targeted
mailing campaign by the Bookbinders Book Club (BBBC). We will attempt to
use predictive modeling approaches to the efficacy of a BBBC direct
mailing campaign. We will be using a subset of the BBBC database
consisting of 400 customers who purchased the book and 1200 who did not.
The dependent variable for our models will be Choice
representing whether or not the customer purchased the book.
A logistic regression model resulted in the highest accuracy and was
best able to predict whether or not a customer on the mailing list would
purchase the book. The independent variables in the model are
Amount_purchased, Frequency,
First_Purchase, P_Child, P_Youth,
P_Cook, P_DIY, and Gender.
Bookbinders Book Club was established in 1986 as a book distributor and kept a database of their 500,000 readers for marketing purposes. This case study is designed to explore part of the data within three different modeling methods to isolate the model that will bring BBBC the most profit.
The three models taken into consideration for this case are: logistic regression, LDA model and support vector machines (SVM). The advantages and disadvantages of each method will be analyzed and compared to find the best model to help BBBC minimize their spending on marketing campaigns while gaining the most profit possible.
As previously stated, this case is based on finding the most cost-efficient predictive model to maximize profit in direct marketing campaigns. It will prove that using statistical analysis within a company can minimize waste costs while helping profits rise each year.
Independent variable:
Dependent variable:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(MASS)
##
## Attaching package: 'MASS'
##
## The following objects are masked from 'package:openintro':
##
## housing, mammals
##
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.92 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:openintro':
##
## densityPlot
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(readr)
library(e1071)
library(readxl)
### Read in data
CS2Test <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Test.xlsx")
str(CS2Test)
## tibble [2,300 × 12] (S3: tbl_df/tbl/data.frame)
## $ Observation : num [1:2300] 1 2 3 4 5 6 7 8 9 10 ...
## $ Choice : num [1:2300] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : num [1:2300] 0 1 1 0 1 0 1 1 1 1 ...
## $ Amount_purchased: num [1:2300] 287 215 261 24 120 66 42 233 66 199 ...
## $ Frequency : num [1:2300] 12 4 2 4 8 2 12 8 12 22 ...
## $ Last_purchase : num [1:2300] 4 1 1 1 1 4 1 2 1 1 ...
## $ First_purchase : num [1:2300] 24 4 2 4 8 16 12 12 12 22 ...
## $ P_Child : num [1:2300] 0 0 0 1 0 0 0 0 0 0 ...
## $ P_Youth : num [1:2300] 3 0 0 0 0 0 0 0 0 0 ...
## $ P_Cook : num [1:2300] 0 0 0 0 0 1 1 0 0 0 ...
## $ P_DIY : num [1:2300] 0 0 0 0 0 1 0 0 0 0 ...
## $ P_Art : num [1:2300] 1 1 1 0 1 1 0 0 0 1 ...
CS2Train <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Train.xlsx")
str(CS2Train)
## tibble [1,600 × 12] (S3: tbl_df/tbl/data.frame)
## $ Observation : num [1:1600] 1 2 3 4 5 6 7 8 9 10 ...
## $ Choice : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
## $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
## $ Frequency : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
## $ Last_purchase : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
## $ First_purchase : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
## $ P_Child : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
## $ P_Youth : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
## $ P_Cook : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
## $ P_DIY : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
## $ P_Art : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...
The null hypothesis is that all coefficients in the model are equal
to zero, meaning none of the predictor variables have a statistically
significant relationship with the response variable,
Choice. The alternative hypothesis states that not every
coefficient is equal to zero and some predictor variables would have a
significant relationship to Choice.
At first, we create the model with the CS2Train data set and test it on CS2Test data set. However, since the amount of Non-purchase is much larger than Purchase, our model has a decent accuracy of 66% while sensitivity is 75% and specificity is 65.12%
Therefore, we decide to create a subset of this data called
df_subset with equal observations of Purchase
and Non-purchase. From the balanced data we used an 80/20
split to create a training and testing dataset.
The data was taken from a recent BBBC mailing campaign sent to 20,000 customers in Pennsylvania, New York, and Ohio. From that dataset, a subset was created consisting of 400 customers who purchased the book and 1200 who did not.
We are developing a logistic regression, linear discriminate analysis, and support vector machine model to test what variables would have impact on the success of customers buying the books. After viewing the accuracy, sensitivity, and specificity of each model, we then can conclude which one fits the data the best.
Logistic regression has the following assumptions: - Dependent variable must be categorical and binary, which they are in this case - The observations must be independent of one another - Multicollinearity should be investigated and eliminated. Independent variables should not be correlated with one another - There must be a large sample size
LDA has these assumptions: - Sample measurements are independent from each other - Distributions are normal - Co-variance of the measurements are identical across different classes.
SVM model assumptions: - The margin should be as large as possible. - The data is independent and identically distributed - However, it doesn’t assume normality of the data
The main advantage of the logistic regression model in this case study is that the cut point can be changed to increase or decrease overall model accuracy, sensitivity, or specificity as needed depending on which measure is most beneficial in analyzing the data. This allowed us to find a nice balance between the three measures. The disadvantage of the LDA model and SVM model in this case was that they were less accurate than the logistic regression even before we changed the cut point. The flexibility of the logistic regression model made it more appropriate for analyzing this dataset.
BBBC should develop expertise in multiple different methods. For this specific dataset, it is important to have some knowledge of logistic regression to analyze and understand the data. Future datasets will not be identical to this one, so it would also be beneficial to understand other methods to help analyze and achieve the best results possible for the company.
One way to automate and simplify the logistic regression model for future use would be to build a script. Since the data has been analyzed once already it would be beneficial to copy the step-by-step process to a script and then simply update the dataset for future use. Along with having a script for running the regression, a schedule could also be used to determine when to reanalyze the data and update the model as well as when the results of the testing will be reported to any stakeholders.
Using str() function, we realize that there are two
variables with wrong classification. We proceed to create two new
columns:
Gender_new that is a categorical variable with 2
factors: Male and FemaleChoice_new that is also a categorical variable with 2
factors: Purchase and Nonpurchase# Create new column for gender and make it a factor
CS2Train$Gender_new <- ifelse(CS2Train$Gender == 0, "Female", "Male")
CS2Train$Gender_new <- as.factor(CS2Train$Gender_new)
CS2Test$Gender_new <- ifelse(CS2Test$Gender == 0, "Female", "Male")
CS2Test$Gender_new <- as.factor(CS2Test$Gender_new)
# Create new column for choice and make it a factor
CS2Train$Choice_new <- ifelse(CS2Train$Choice == 0, "Nonpurchase", "Purchase")
CS2Train$Choice_new <- as.factor(CS2Train$Choice_new)
CS2Test$Choice_new <- ifelse(CS2Test$Choice == 0, "Nonpurchase", "Purchase")
CS2Test$Choice_new <- as.factor(CS2Test$Choice_new)
Data distribution: By creating a histogram of each numeric variables,
we can see that only Amount_purchased has a normal
distribution while other variables are right-skewed, which can
dratiscally affect the result of linear regression model, if we decide
to use it.
CS2Train$Last_purchase <- as.numeric(CS2Train$Last_purchase)
par(mfrow=c(3,3))
hist(CS2Train$Amount_purchased, xlab = "Amount Purchased")
hist(CS2Train$Frequency, xlab = "Total Num. Purchases in Study")
hist(CS2Train$Last_purchase, xlab = "Months Since Last Purchase")
hist(CS2Train$First_purchase, xlab = "Months Since First Purchase")
hist(CS2Train$P_Child, xlab = "No. Children's Books Purchased")
hist(CS2Train$P_Youth, xlab = "No. Youth Books Purchased")
hist(CS2Train$P_Cook, xlab = "No. Cook Books Purchased")
hist(CS2Train$P_DIY, xlab = "No. DIY Books Purchased")
hist(CS2Train$P_Art, xlab = "No. Art Books Purchased")
By creating a correlation plot, we notice that
Last_purchase has high VIF for both Training and Testing
data so we remove that variable from the dataset. Our group also proceed
to remove P_art as they highly correlated with the
dependent variable, as well as Gender, Choice,
and Observation since we don’t need them in the dataset
# Remove Gender and Choice because numeric, Observation cuz we dont need.
# Remove P_Art correlates w dependent variable
# Last Purchase has high VIF of 15 for training and 14 for testing
CS2Train = dplyr::select(CS2Train, -c(Gender,Choice,Observation,P_Art, Last_purchase))
CS2Test = dplyr::select(CS2Test, -c(Gender,Choice,Observation,P_Art, Last_purchase))
### Create correlation plot
# Create correlation plot (must have numerical values)
train_num = select_if(CS2Train, is.numeric)
M1 = cor(train_num)
corrplot(M1, method = "number") #Last purchase and first purchase has 0.81 correlation then w p_child and p_cook
Because both the Training and Testing dataset has unequal
observations of Purchase and Non_Purchase,
which results in a good model of linear regression but bad LDA and SVM
for Sensitivity of predicting a Purchase response, we
create a new subset from the CS2Test dataset with equal
number of observations for Purchase and
Non_Purchase
### Logistic Regression model with original data
# Fit a logistic regression model for CS2Train
m1.log = glm(Choice_new ~ ., data = CS2Train, family = binomial)
summary(m1.log)
##
## Call:
## glm(formula = Choice_new ~ ., family = binomial, data = CS2Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.07505 -0.74193 -0.51191 0.02636 2.69320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0804606 0.2036865 0.395 0.692827
## Amount_purchased 0.0030341 0.0007417 4.091 4.30e-05 ***
## Frequency -0.1743342 0.0141715 -12.302 < 2e-16 ***
## First_purchase 0.0859290 0.0090296 9.516 < 2e-16 ***
## P_Child -0.5508914 0.0864517 -6.372 1.86e-10 ***
## P_Youth -0.3854852 0.1164069 -3.312 0.000928 ***
## P_Cook -0.7124793 0.0910308 -7.827 5.00e-15 ***
## P_DIY -0.5875044 0.1168082 -5.030 4.91e-07 ***
## Gender_newMale -0.7896791 0.1295524 -6.095 1.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1536.6 on 1591 degrees of freedom
## AIC: 1554.6
##
## Number of Fisher Scoring iterations: 5
vif(m1.log)
## Amount_purchased Frequency First_purchase P_Child
## 1.199103 1.957906 5.617250 1.950201
## P_Youth P_Cook P_DIY Gender_new
## 1.292000 2.055848 1.454268 1.018119
# Make prediction for logistic regression model for CS2Test
predprob = predict.glm(m1.log, newdata = CS2Test, type = "response")
predclass_log = ifelse(predprob >= 0.228,"Purchase", "Nonpurchase")
caret::confusionMatrix(as.factor(predclass_log), as.factor(CS2Test$Choice_new), positive = "Purchase")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonpurchase Purchase
## Nonpurchase 1365 51
## Purchase 731 153
##
## Accuracy : 0.66
## 95% CI : (0.6402, 0.6794)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1602
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75000
## Specificity : 0.65124
## Pos Pred Value : 0.17308
## Neg Pred Value : 0.96398
## Prevalence : 0.08870
## Detection Rate : 0.06652
## Detection Prevalence : 0.38435
## Balanced Accuracy : 0.70062
##
## 'Positive' Class : Purchase
##
# testing LDA model with original data:
ori.lda = lda(Choice_new ~ ., data = CS2Train)
origpred_lda = predict(ori.lda, newdata = CS2Test)
caret::confusionMatrix(as.factor(origpred_lda$class), as.factor(CS2Test$Choice_new), positive = "Purchase")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonpurchase Purchase
## Nonpurchase 2003 162
## Purchase 93 42
##
## Accuracy : 0.8891
## 95% CI : (0.8756, 0.9017)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.1906
##
## Mcnemar's Test P-Value : 2.06e-05
##
## Sensitivity : 0.20588
## Specificity : 0.95563
## Pos Pred Value : 0.31111
## Neg Pred Value : 0.92517
## Prevalence : 0.08870
## Detection Rate : 0.01826
## Detection Prevalence : 0.05870
## Balanced Accuracy : 0.58076
##
## 'Positive' Class : Purchase
##
Here we start to create a new subset of data:
# load data again so we have columns that we deleted earlier
CS2Test2 <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Test.xlsx")
CS2Train2 <- read_excel("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/BookBinder Case Study/BBBC-Train.xlsx")
CS2Train2$Gender_new <- ifelse(CS2Train2$Gender == 0, "Female", "Male")
CS2Train2$Gender_new <- as.factor(CS2Train2$Gender_new)
CS2Test2$Gender_new <- ifelse(CS2Test2$Gender == 0, "Female", "Male")
CS2Test2$Gender_new <- as.factor(CS2Test2$Gender_new)
CS2Train2$Choice_new <- ifelse(CS2Train2$Choice == 0, "Nonpurchase", "Purchase")
CS2Train2$Choice_new <- as.factor(CS2Train2$Choice_new)
CS2Test2$Choice_new <- ifelse(CS2Test2$Choice == 0, "Nonpurchase", "Purchase")
CS2Test2$Choice_new <- as.factor(CS2Test2$Choice_new)
# Create a subset of the dataset with only purchase and nonpurchase categories
set.seed(1)
df_subset <- filter(CS2Train2, Choice_new == "Purchase" | Choice_new == "Nonpurchase")
# Determine the number of observations for each category
num_purchase <- sum(df_subset$Choice_new == "Purchase")
num_nonpurchase <- sum(df_subset$Choice_new == "Nonpurchase")
# Sample the larger category (if any) to have the same number of observations as the smaller category
if (num_purchase > num_nonpurchase) {
df_purchase <- filter(df_subset, Choice_new == "Purchase")
df_nonpurchase <- filter(df_subset, Choice_new == "Nonpurchase")
df_purchase <- sample_n(df_purchase, num_nonpurchase)
} else {
df_purchase <- filter(df_subset, Choice_new == "Purchase")
df_nonpurchase <- filter(df_subset, Choice_new == "Nonpurchase")
df_nonpurchase <- sample_n(df_nonpurchase, num_purchase)
}
# Combine the two subsets into a new dataset using rbind()
df_new <- rbind(df_purchase, df_nonpurchase)
set.seed(1)
tr_ind = sample(nrow(df_new), 0.8*nrow(df_new), replace = F)
CS2Train_new = df_new[tr_ind,]
CS2Test_new = df_new[-tr_ind,]
# Check if df_purchase and df_nonpurchase has same row
nrow(df_purchase)
## [1] 400
nrow(df_nonpurchase)
## [1] 400
df_new = dplyr::select(df_new, -c(Gender,Choice,Observation,P_Art, Last_purchase))
CS2Test_new = dplyr::select(CS2Test_new, -c(Gender,Choice,Observation,P_Art, Last_purchase))
CS2Train_new = dplyr::select(CS2Train_new, -c(Gender,Choice,Observation,P_Art, Last_purchase))
# Logistic Regression model
# Fit a logistic regression model for df_new
m1.log = glm(Choice_new ~ ., data = CS2Train_new, family = binomial) #logistic model with CS2Train2
summary(m1.log)
##
## Call:
## glm(formula = Choice_new ~ ., family = binomial, data = CS2Train_new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.38526 -0.98555 -0.00276 0.96207 2.46864
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.297872 0.282061 4.601 4.20e-06 ***
## Amount_purchased 0.002672 0.001043 2.562 0.010420 *
## Frequency -0.181851 0.019409 -9.369 < 2e-16 ***
## First_purchase 0.081910 0.012755 6.422 1.35e-10 ***
## P_Child -0.502231 0.116713 -4.303 1.68e-05 ***
## P_Youth -0.240124 0.158149 -1.518 0.128929
## P_Cook -0.785590 0.132580 -5.925 3.12e-09 ***
## P_DIY -0.559802 0.167079 -3.351 0.000807 ***
## Gender_newMale -0.728652 0.184407 -3.951 7.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 887.23 on 639 degrees of freedom
## Residual deviance: 740.93 on 631 degrees of freedom
## AIC: 758.93
##
## Number of Fisher Scoring iterations: 4
# Make prediction for logistic regression model for CS2Test
predprob = predict.glm(m1.log, newdata = CS2Test_new, type = "response")
predclass_log = ifelse(predprob >= 0.48,"Purchase", "Nonpurchase")
caret::confusionMatrix(as.factor(predclass_log), as.factor(CS2Test_new$Choice_new), positive = "Purchase")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonpurchase Purchase
## Nonpurchase 53 29
## Purchase 27 51
##
## Accuracy : 0.65
## 95% CI : (0.5707, 0.7236)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 9.155e-05
##
## Kappa : 0.3
##
## Mcnemar's Test P-Value : 0.8937
##
## Sensitivity : 0.6375
## Specificity : 0.6625
## Pos Pred Value : 0.6538
## Neg Pred Value : 0.6463
## Prevalence : 0.5000
## Detection Rate : 0.3187
## Detection Prevalence : 0.4875
## Balanced Accuracy : 0.6500
##
## 'Positive' Class : Purchase
##
Based on the result, we have a logistic regression model: Choice = 1.298 + 0.003 amount_purchase – 0.181 frequency + 0.082 first_purchase – 0.502 p_child – 0.240 p_youth – 0.786p_cook – 0.560 p_DIY – 0.729 Gender_newMale
The model suggests that when total money spent (amount_purchased) and months since first_purchase increase, customers would more likely to buy the Art of History of Florence. While increasing of spending in other books such as youth, cook book, etc will decrease the chance of buying AHF
The logistic model also have a moderate accuracy of 65%, and a moderate score of predicting the purchase of AHF books with sensitivity and specificity of 63.75% and 66.25% respectively
m1.lda = lda(Choice_new ~ ., data = CS2Train_new)
m1.lda
## Call:
## lda(Choice_new ~ ., data = CS2Train_new)
##
## Prior probabilities of groups:
## Nonpurchase Purchase
## 0.5 0.5
##
## Group means:
## Amount_purchased Frequency First_purchase P_Child P_Youth P_Cook
## Nonpurchase 194.875 13.70625 23.06875 0.753125 0.309375 0.78125
## Purchase 218.650 8.90625 22.69375 0.775000 0.371875 0.67500
## P_DIY Gender_newMale
## Nonpurchase 0.36250 0.659375
## Purchase 0.38125 0.540625
##
## Coefficients of linear discriminants:
## LD1
## Amount_purchased 0.002840555
## Frequency -0.174529019
## First_purchase 0.078326151
## P_Child -0.498301513
## P_Youth -0.239746786
## P_Cook -0.747554348
## P_DIY -0.558689379
## Gender_newMale -0.735903929
predclass_lda = predict(m1.lda, newdata = CS2Test_new)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(CS2Test_new$Choice_new), positive = "Purchase")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonpurchase Purchase
## Nonpurchase 53 30
## Purchase 27 50
##
## Accuracy : 0.6438
## 95% CI : (0.5643, 0.7178)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.0001719
##
## Kappa : 0.2875
##
## Mcnemar's Test P-Value : 0.7910815
##
## Sensitivity : 0.6250
## Specificity : 0.6625
## Pos Pred Value : 0.6494
## Neg Pred Value : 0.6386
## Prevalence : 0.5000
## Detection Rate : 0.3125
## Detection Prevalence : 0.4813
## Balanced Accuracy : 0.6438
##
## 'Positive' Class : Purchase
##
The LDA model also show a positive correlation between
amount_purchased and first_purchase to the
likelihood of buying AHF books.
When, comparing to the logistic regression model, the LDA performance was slightly worse in Specificity. Its accuracy is 64.38%. Sensitivity is slightly higher to log reg model at 62.50% while Specificity is lower at 66.25%
form1 = Choice_new ~.
tuned = tune.svm(form1, data=CS2Train_new, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))
mysvm = svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost)
summary(mysvm)
##
## Call:
## svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 497
##
## ( 246 251 )
##
##
## Number of Classes: 2
##
## Levels:
## Nonpurchase Purchase
mysvm
##
## Call:
## svm(formula = form1, data = CS2Train_new, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 497
svmpredict = predict(mysvm, newdata = CS2Test_new, type = 'response')
caret::confusionMatrix(as.factor(svmpredict), as.factor(CS2Test_new$Choice_new), positive = "Purchase")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonpurchase Purchase
## Nonpurchase 55 33
## Purchase 25 47
##
## Accuracy : 0.6375
## 95% CI : (0.5579, 0.7119)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.0003147
##
## Kappa : 0.275
##
## Mcnemar's Test P-Value : 0.3580197
##
## Sensitivity : 0.5875
## Specificity : 0.6875
## Pos Pred Value : 0.6528
## Neg Pred Value : 0.6250
## Prevalence : 0.5000
## Detection Rate : 0.2938
## Detection Prevalence : 0.4500
## Balanced Accuracy : 0.6375
##
## 'Positive' Class : Purchase
##
The SVM model also produces a decent score of accuracy of 63.75%, which is lower than both LogReg, and LDA model. Its accuracy of predicting a “Purchase” is 58.75% and predicting a “Non Purchase” is 68.75%
Overall, Logistic Regression model is the best out of 3 three models, with slightly higher of accuracy, sensitivity, and specificity. SVM model has the worst sensitivity, which is below 60%
First we calculate the profit of each book sold. With the profit per book is $10.2
# Profit of each book
mailing = 0.65
manu_cost = 15
overhead_cost = 0.45*15
cost_book = manu_cost + overhead_cost
print(paste("Cost per book is: ", cost_book))
## [1] "Cost per book is: 21.75"
selling_price = 31.95
profit_book = selling_price - cost_book
print(paste("Profit per book is: ", profit_book))
## [1] "Profit per book is: 10.2"
Based on past mailing campaign, the percentage of people will buy the book is only at 9.03% if we mail to all customer. Using the same number for the campaign this year, the profit of mailing to all 50,000 customers without using any prediction model is calculated. The profit only as low as $13,553.
# rate for this year is:
mail_cost_nomodel = mailing*50000
mail_cost_nomodel
## [1] 32500
response_no_model = 0.0903*50000
response_no_model
## [1] 4515
profit_no_model = response_no_model*profit_book - mail_cost_nomodel
print(paste("The profit if we use no model is: ", profit_no_model))
## [1] "The profit if we use no model is: 13553"
However, when using different models we build, the profit increases significantly. - Logistic regression model will increase profit to $138,187.50 - LDA model will increase profit to $135,312.50 - SVM model will increase profit to $127,312.50
# Profit from log model
log_num_mail = ((27+51)/160)*50000
log_buy = (51/160)*50000
log_profit = log_buy*profit_book - log_num_mail
print(paste("The profit with logistic regression model is: ", log_profit))
## [1] "The profit with logistic regression model is: 138187.5"
# Profit from lda model
lda_num_mail = ((27+50)/160)*50000
lda_buy = (50/160)*50000
lda_profit = lda_buy*profit_book - lda_num_mail
print(paste("The profit with LDA model is: ", lda_profit))
## [1] "The profit with LDA model is: 135312.5"
# Profit from svm model (sensitivity is 65%)
svm_num_mail = ((25+47)/160)*50000
svm_buy = (47/160)*50000
svm_profit = svm_buy*profit_book - svm_num_mail
print(paste("The profit with SVM model is: ", svm_profit))
## [1] "The profit with SVM model is: 127312.5"
Overall, using 3 three models, BBBC can profit as much as $100,000
more than when not using any models. With logistic regression model, we
know that total money spent amount_purchased and months
since first purchase first_purchase are positively
correlated with the response of Purchase. Based on the
coefficient for Gender, women are more likely to order a
book than men. Therefore, BBBC should target customer who are women, has
spent the most money, but hasn’t purchased a book recently.
A linear regression model is not appropriate because the dependent
variables are categorical, therefore providing poor results. Since the
dependent variable doesn’t have the same number of observations, we need
to create a subset of data that Purchase and
Non_Purchase have same number of observations. Logistic
Regression performs slightly better than other two models of LDA and
SVM. However, if we know how to apply these models into real business
world, such as the case of BBBC mailing campaign, we can create a much
higher profit while minimizing cost.