Read in the train and test XLS sheets.
BBBC.train <- read_excel("~/2021 Spring MSDA/DA-6813 Data Analytics Applications/Case Study 2/BBBC-Train.xlsx")
BBBC.test <- read_excel("~/2021 Spring MSDA/DA-6813 Data Analytics Applications/Case Study 2/BBBC-Test.xlsx")
The client has provided us with more testing observations than training observations, with the training data set numbering 1600 observations and the testing dataset numbering 2300 observations. Larger testing datasets may ensure a more accurate calculation of model performance to the full population from which it was drawn.
Observation is a variable that is not useful to us. Removing it.
Both Choice and Gender are binary and categorical. Setting them to be factors early on.
BBBC.train$Observation <- NULL
BBBC.test$Observation <- NULL
BBBC.train$Choice <- as.factor(BBBC.train$Choice)
BBBC.train$Gender <- as.factor(BBBC.train$Gender)
BBBC.test$Choice <- as.factor(BBBC.test$Choice)
BBBC.test$Gender <- as.factor(BBBC.test$Gender)
#Create pseudo-train and pseudo-test as 70/30 split from BBBC.train
#These will be used to train the GLM
set.seed(2021)
pseudo.index <- sample(1:nrow(BBBC.train), 0.7*nrow(BBBC.train))
BBBC.pseudo.train <- BBBC.train[pseudo.index, ]
BBBC.pseudo.test <- BBBC.train[-pseudo.index, ]
BBBC.pseudo.test.true <- BBBC.pseudo.test$Choice
Let’s take a quick look at the variables for BBBC.train and see if there’s any relations:
pairs(BBBC.train)
Looking at a table might provide useful data as well:
cor(BBBC.train[sapply(BBBC.train, is.numeric)])
## Amount_purchased Frequency Last_purchase First_purchase
## Amount_purchased 1.00000000 0.0136664846 0.44070127 0.3748139
## Frequency 0.01366648 1.0000000000 -0.04194328 0.4459457
## Last_purchase 0.44070127 -0.0419432803 1.00000000 0.8146747
## First_purchase 0.37481393 0.4459457457 0.81467469 1.0000000
## P_Child 0.29931372 -0.0433279437 0.67913392 0.5448208
## P_Youth 0.18755727 -0.0095854745 0.45325891 0.3678921
## P_Cook 0.30425340 0.0004968833 0.67250539 0.5710548
## P_DIY 0.22331539 -0.0089634125 0.55816739 0.4620188
## P_Art 0.27248948 -0.0613754066 0.53433415 0.4420821
## P_Child P_Youth P_Cook P_DIY P_Art
## Amount_purchased 0.29931372 0.187557270 0.3042533969 0.223315392 0.27248948
## Frequency -0.04332794 -0.009585474 0.0004968833 -0.008963412 -0.06137541
## Last_purchase 0.67913392 0.453258910 0.6725053933 0.558167395 0.53433415
## First_purchase 0.54482083 0.367892128 0.5710547918 0.462018843 0.44208206
## P_Child 1.00000000 0.174826719 0.2947065185 0.253837077 0.22451285
## P_Youth 0.17482672 1.000000000 0.1816566401 0.188683456 0.14175122
## P_Cook 0.29470652 0.181656640 1.0000000000 0.271725126 0.19168076
## P_DIY 0.25383708 0.188683456 0.2717251256 1.000000000 0.20779106
## P_Art 0.22451285 0.141751220 0.1916807611 0.207791065 1.00000000
Highest correlations are between First_purchase and Last_purchase, which makes sense in a way. An only customer makes a purchase and walks out, that may be both a First_purchase and Last_purchase observation. Repeat customers will have a First_purchase and a smaller Last_purchase. These may influence the results of our model.
Let’s examine boxplots of the non-categorical variables to get an idea of how the independent variables are distributed across the entire BBBC.train dataset.
par(mfrow=c(3,3))
hist(BBBC.train$Amount_purchased, xlab = "Amount Purchased")
hist(BBBC.train$Frequency, xlab = "Total Num. Purchases in Study")
hist(BBBC.train$Last_purchase, xlab = "Months Since Last Purchase")
hist(BBBC.train$First_purchase, xlab = "Months Since First Purchase")
hist(BBBC.train$P_Child, xlab = "No. Children's Books Purchased")
hist(BBBC.train$P_Youth, xlab = "No. Youth Books Purchased")
hist(BBBC.train$P_Cook, xlab = "No. Cook Books Purchased")
hist(BBBC.train$P_DIY, xlab = "No. DIY Books Purchased")
hist(BBBC.train$P_Art, xlab = "No. Art Books Purchased")
The independent variable Amount_purchased is about the only variable that looks normally distributed. All of the others show a distinct right skew to the data, which may affect a linear regression model.
Linear regression has certain assumptions:
In a linear regression model, it is possible to have the dependent variable go from negative to positive infinity. This is not compatible with a binary dependent variable, and is not suited to this case.
Linear regression modeling may be useful to BBBC if were they to be interested in forecasting revenue levels from a particular genre or title, and to draw inferences on what variables most affect genre profits.
Creating a LM with all variables.
lm1 <- lm(as.numeric(Choice) ~ ., data = BBBC.train)
summary(lm1)
##
## Call:
## lm(formula = as.numeric(Choice) ~ ., data = BBBC.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9603 -0.2462 -0.1161 0.1622 1.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3642284 0.0307411 44.378 < 2e-16 ***
## Gender1 -0.1309205 0.0200303 -6.536 8.48e-11 ***
## Amount_purchased 0.0002736 0.0001110 2.464 0.0138 *
## Frequency -0.0090868 0.0021791 -4.170 3.21e-05 ***
## Last_purchase 0.0970286 0.0135589 7.156 1.26e-12 ***
## First_purchase -0.0020024 0.0018160 -1.103 0.2704
## P_Child -0.1262584 0.0164011 -7.698 2.41e-14 ***
## P_Youth -0.0963563 0.0201097 -4.792 1.81e-06 ***
## P_Cook -0.1414907 0.0166064 -8.520 < 2e-16 ***
## P_DIY -0.1352313 0.0197873 -6.834 1.17e-11 ***
## P_Art 0.1178494 0.0194427 6.061 1.68e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared: 0.2401, Adjusted R-squared: 0.2353
## F-statistic: 50.2 on 10 and 1589 DF, p-value: < 2.2e-16
OK, we see a lot of variables that are significant. Are any of these going to cause problems from a multicollinearity point of view?
vif(lm1)
## Gender Amount_purchased Frequency Last_purchase
## 1.005801 1.248066 3.253860 18.770402
## First_purchase P_Child P_Youth P_Cook
## 9.685333 3.360349 1.775022 3.324928
## P_DIY P_Art
## 2.016910 2.273771
Yep. Last_purchase has a really high value. Let’s remove it and see if things get a little cleaner for our model.
lm1.novif <- lm(as.numeric(Choice) ~ . - Last_purchase, BBBC.train)
vif(lm1.novif)
## Gender Amount_purchased Frequency First_purchase
## 1.005634 1.235982 2.651820 7.182666
## P_Child P_Youth P_Cook P_DIY
## 1.949849 1.307915 2.009609 1.457362
## P_Art
## 1.634878
OK, let’s call that good and run with that model.
summary(lm1.novif)
##
## Call:
## lm(formula = as.numeric(Choice) ~ . - Last_purchase, data = BBBC.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0018 -0.2482 -0.1277 0.1567 1.1035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3926595 0.0309609 44.981 < 2e-16 ***
## Gender1 -0.1290720 0.0203424 -6.345 2.89e-10 ***
## Amount_purchased 0.0003518 0.0001122 3.135 0.001753 **
## Frequency -0.0157943 0.0019980 -7.905 4.97e-15 ***
## First_purchase 0.0046036 0.0015884 2.898 0.003803 **
## P_Child -0.0502183 0.0126891 -3.958 7.90e-05 ***
## P_Youth -0.0225339 0.0175326 -1.285 0.198888
## P_Cook -0.0667467 0.0131127 -5.090 4.00e-07 ***
## P_DIY -0.0606486 0.0170835 -3.550 0.000396 ***
## P_Art 0.1916012 0.0167447 11.443 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3847 on 1590 degrees of freedom
## Multiple R-squared: 0.2156, Adjusted R-squared: 0.2111
## F-statistic: 48.55 on 9 and 1590 DF, p-value: < 2.2e-16
Let’s look at the diagnostics plots of the initial linear model to see if we have any violations of the model’s assumptions:
par(mfrow=c(2,2))
plot(lm1.novif, which=c(1:4))
It looks like there’s non-normality in the model response.
Creating a LM using stepwise model selection, let’s go with a selection threshold of 0.05 and see if it produces a cleaner model.
lm1.stepwise <- ols_step_both_p(lm1, pent = 0.05, prem = 0.05, details = F)
lm1.stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------------
## 1 P_Art addition 0.128 0.127 227.4360 1649.2034 0.4046
## 2 Frequency addition 0.170 0.169 142.0340 1572.6125 0.3949
## 3 Gender addition 0.188 0.186 106.5840 1539.7192 0.3908
## 4 P_Cook addition 0.200 0.198 82.3710 1516.8351 0.3879
## 5 P_DIY addition 0.204 0.201 77.1250 1511.8878 0.3871
## 6 Amount_purchased addition 0.208 0.205 70.8160 1505.8838 0.3863
## 7 P_Child addition 0.211 0.208 64.8840 1500.2051 0.3855
## 8 Last_purchase addition 0.228 0.224 31.7060 1467.7006 0.3815
## 9 P_Youth addition 0.239 0.235 10.2160 1446.2388 0.3788
## -----------------------------------------------------------------------------------------------
Create the stepwise LM.
lm1.step <- lm(as.numeric(Choice)~ P_Art + Frequency + Gender + P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth, BBBC.train)
summary(lm1.step)
##
## Call:
## lm(formula = as.numeric(Choice) ~ P_Art + Frequency + Gender +
## P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase +
## P_Youth, data = BBBC.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9802 -0.2452 -0.1157 0.1655 1.0595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3727367 0.0297590 46.128 < 2e-16 ***
## P_Art 0.1150034 0.0192719 5.967 2.97e-09 ***
## Frequency -0.0110830 0.0012128 -9.138 < 2e-16 ***
## Gender1 -0.1316464 0.0200208 -6.575 6.56e-11 ***
## P_Cook -0.1433497 0.0165218 -8.676 < 2e-16 ***
## P_DIY -0.1365578 0.0197520 -6.914 6.82e-12 ***
## Amount_purchased 0.0002742 0.0001110 2.470 0.0136 *
## P_Child -0.1275991 0.0163571 -7.801 1.11e-14 ***
## Last_purchase 0.0894288 0.0116772 7.658 3.25e-14 ***
## P_Youth -0.0973642 0.0200903 -4.846 1.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared: 0.2395, Adjusted R-squared: 0.2352
## F-statistic: 55.63 on 9 and 1590 DF, p-value: < 2.2e-16
Quick interpretation:
P_Art Customer who purchase more art books might purchase more more copies of The Art History of FlorenceFrequency Customers who purchased more books during the chosen might purchase fewer copies of TAHFGender1 (male) On average the gentlemen might purchase fewer copies of TAHF than the ladiesP_Cook Customers who purchase more cooking books might purchase fewer copies of TAHFP_DIY Customers who purchase more DIY books might purchase fewer copies of TAHFAmount_purchased Customers who spent more on books might be more likely to purchase TAHFP_Child Customers who purchase more children’s books might purchase fewer copies of TAHFLast_purchase Customers who haven’t purchased a book in a while might purchase more copies of TAHFP_Youth Customers who purchase more youth books might purchase fewer copies of TAHFDiagnostics of stepwise model:
par(mfrow=c(2,2))
plot(lm1.step, which=c(1:4))
Check VIF:
vif(lm1.step)
## P_Art Frequency Gender P_Cook
## 2.233698 1.007855 1.004715 3.290657
## P_DIY Amount_purchased P_Child Last_purchase
## 2.009454 1.248033 3.341880 13.920175
## P_Youth
## 1.771355
We still see that Last_purchase has a very high VIF value, and the QQ plot shows a non-normal trace in the data.
Linear regression is not a good method to provide inferences on whether a customer will buy a book or not. The model does not take into consideration the binary nature of the variable Choice, and in fact can extend to either positive or negative infinity. A linear regression model may be useful for illustrating total revenue trends of P_Art, but it is not a good choice for classification or prediction.
Given that the data is already pushing the limits of the assumptions of linear regression we really should NOT be putting money on a linear regression model!
The logistic regression model is better suited to binary classification and prediction than linear regression. Logistic regression does not follow the same assumptions as linear regression:
Logistic regression has the following assumptions:
Constructing base logistic regression model and examining the variance inflation factors:
glm.base <- glm(Choice ~ ., BBBC.pseudo.train, family = "binomial")
vif(glm.base)
## Gender Amount_purchased Frequency Last_purchase
## 1.019640 1.251838 2.717041 16.997938
## First_purchase P_Child P_Youth P_Cook
## 9.356885 2.821611 1.708766 3.104496
## P_DIY P_Art
## 2.044643 1.924196
Last_purchase has high VIF, so we need to remove it first, then evaluate VIF again.
glm.base<-glm(Choice ~ . - Last_purchase, BBBC.pseudo.train, family="binomial")
vif(glm.base)
## Gender Amount_purchased Frequency First_purchase
## 1.016868 1.238338 2.331275 7.062058
## P_Child P_Youth P_Cook P_DIY
## 1.887517 1.352018 2.057100 1.511540
## P_Art
## 1.637684
First_purchase has highest VIF this time, so removing it, too.
glm.base <- glm(Choice ~ . - Last_purchase - First_purchase, BBBC.pseudo.train, family = "binomial")
vif(glm.base)
## Gender Amount_purchased Frequency P_Child
## 1.018346 1.234808 1.019653 1.247260
## P_Youth P_Cook P_DIY P_Art
## 1.057154 1.251634 1.188582 1.258966
The VIF scores look good. Let’s proceed with this as our base full model and run stepwise model selection.
#We're building these on the pseudo.train dataset
glm.null <- glm(Choice ~ 1, BBBC.pseudo.train, family = "binomial")
glm.full <- glm(Choice ~ . - Last_purchase - First_purchase, BBBC.pseudo.train, family = "binomial")
glm.step <- step(glm.null, scope = list(upper = glm.full),
direction = "both", test="Chisq", trace = F)
summary(glm.step)
##
## Call:
## glm(formula = Choice ~ P_Art + Frequency + Gender + P_Cook +
## P_DIY + P_Child + Amount_purchased, family = "binomial",
## data = BBBC.pseudo.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2847 -0.6829 -0.4688 -0.1649 2.8352
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1808844 0.2399106 -0.754 0.45087
## P_Art 1.2319402 0.1198115 10.282 < 2e-16 ***
## Frequency -0.0886377 0.0123896 -7.154 8.42e-13 ***
## Gender1 -0.8779051 0.1610590 -5.451 5.01e-08 ***
## P_Cook -0.2551691 0.0898045 -2.841 0.00449 **
## P_DIY -0.3468313 0.1314464 -2.639 0.00833 **
## P_Child -0.1816776 0.0884965 -2.053 0.04008 *
## Amount_purchased 0.0018811 0.0009267 2.030 0.04236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1234.9 on 1119 degrees of freedom
## Residual deviance: 1001.6 on 1112 degrees of freedom
## AIC: 1017.6
##
## Number of Fisher Scoring iterations: 5
Interpretation:
Goodness-of-fit Test
hoslem.test(glm.step$y, fitted(glm.step), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.step$y, fitted(glm.step)
## X-squared = 7.1344, df = 8, p-value = 0.5222
The null hypothesis of the Hosmer Lemeshow test is that the model is adequate and fits the data well. The alternate hypothesis is that the model is not adequate and does not fit the data well. Here we see that the p-value of the Hosmer Lemeshow test indicates we fail to reject the null hypothesis, so we can say that our stepwise logistical model fits the data well.
Check Cook’s distance to see if there are any influential points:
plot(glm.step, which = 4)
R identifies three points (this is by default), but none of them exceed a Cook’s distance of 0.03. Given we have no guidance on influential points, let’s leave them as is.
Checking the model’s pseudo-R^2:
glm.pseudoR2 <- PseudoR2(glm.step, which = c("McFadden","Nagel","CoxSnell"))
round(glm.pseudoR2, 3)
## McFadden Nagelkerke CoxSnell
## 0.189 0.281 0.188
The Pseudo R^2 of this model ranges from 18.9% for McFadden to 28.1% for Nagelkerke.
Evaluate the predictive performance of the glm.step model on our training set with a selection threshold of 0.5 and use a Confusion Matrix to evaluate model accuracy, and tune it if needed, before applying it to the test set.
glm.step.probs <- predict(glm.step, BBBC.pseudo.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.5] = 1
caret::confusionMatrix(as.factor(glm.step.preds), BBBC.pseudo.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 326 80
## 1 23 51
##
## Accuracy : 0.7854
## 95% CI : (0.7459, 0.8213)
## No Information Rate : 0.7271
## P-Value [Acc > NIR] : 0.001995
##
## Kappa : 0.3743
##
## Mcnemar's Test P-Value : 3.432e-08
##
## Sensitivity : 0.3893
## Specificity : 0.9341
## Pos Pred Value : 0.6892
## Neg Pred Value : 0.8030
## Prevalence : 0.2729
## Detection Rate : 0.1062
## Detection Prevalence : 0.1542
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : 1
##
The unoptimized model’s overall accuracy is 78.5%, it doesn’t score well at all with predicting “1” Choice from the reference “1” Choice, and has a Sensitivity (Recall) of 38.9%. It does a decent job at predicting “0” Choice from the reference “0” Choice with a specificity of 93.4%.
In the case of applying a predictive marketing model to a profit-making endeavour, we would rely upon shipping only to cases where the model predicts the Choice would be “1”. This corresponds to the Pos Pred Value in the Confusion Matrix above, given as TP/(TP+FP). That value is 68.9% on the training data.
Evaluating this un-tuned, un-optimized model against the testing set:
glm.step.probs <- predict(glm.step, BBBC.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.5] = 1
caret::confusionMatrix(as.factor(glm.step.preds), BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1990 130
## 1 106 74
##
## Accuracy : 0.8974
## 95% CI : (0.8843, 0.9095)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9904
##
## Kappa : 0.3297
##
## Mcnemar's Test P-Value : 0.1343
##
## Sensitivity : 0.36275
## Specificity : 0.94943
## Pos Pred Value : 0.41111
## Neg Pred Value : 0.93868
## Prevalence : 0.08870
## Detection Rate : 0.03217
## Detection Prevalence : 0.07826
## Balanced Accuracy : 0.65609
##
## 'Positive' Class : 1
##
In the case of applying a predictive marketing model to a profit-making endeavour, we would rely upon shipping only to cases where the model predicts the Choice would be “1”. This corresponds to the Pos Pred Value in the Confusion Matrix above, given as TP/(TP+FP). That value is 41.1% on the testing data. The overall model accuracy is 89.74%,
If we run with an un-optimized GLM, what would our profit be?
VC.mail.posting <- 0.65
VC.book <- 15
VC.book.OH <- VC.book*0.45
P.book <- 31.95
Prof.book <- P.book - VC.mail.posting - VC.book - VC.book.OH
N.Cust <- 2300
N.booksold <- 74
N.mailing <- 106
Profit.unOpt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.unOpt <- Profit.unOpt / N.Cust
Profit.unOpt
## [1] 637.8
Prof.per.Cust.unOpt
## [1] 0.2773043
Running a model where we haven’t optimized the Sensitivity / Specificity means that if BBBC used the unoptimized model’s predictions, and marketed only to those people, they would pull in a profit of $637.8 from the test set, or 28 cents per customer based on the test set.
Tune the selection threshold to optimize both Sensitivity and Specificity before predicting against the test set.
Unleash the Tuning Kraken!
#ROC Curve and AUC
pr = predict(glm.step, BBBC.pseudo.test, type = "response") # make prediction with last built model on training set
response = BBBC.pseudo.test$Choice
pred <- prediction(pr, response) #Predicted Probability and True Classification
# area under curve
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
# some important statistics
false.rates <-performance(pred, "fpr","fnr")
accuracy <-performance(pred, "acc","err")
perf <- performance(pred, "tpr","fpr")
#plotting the ROC curve and computing AUC
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
# computing threshold for cutoff to best trade off sensitivity and specificity
#first sensitivity
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE) # plot another line in same plot
#second specificity
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
#find where the lines intersect
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x #this is the optimal points to best trade off sensitivity and specificity
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)
Our optimal prediction threshold is 0.22. Now that we have tuned our GLM on the training set, let’s see how it performs on the test set (fingers crossed)!
glm.step.probs <- predict(glm.step, BBBC.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.22] = 1
caret::confusionMatrix(as.factor(glm.step.preds), BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1462 55
## 1 634 149
##
## Accuracy : 0.7004
## 95% CI : (0.6813, 0.7191)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1876
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.73039
## Specificity : 0.69752
## Pos Pred Value : 0.19029
## Neg Pred Value : 0.96374
## Prevalence : 0.08870
## Detection Rate : 0.06478
## Detection Prevalence : 0.34043
## Balanced Accuracy : 0.71396
##
## 'Positive' Class : 1
##
The optimized GLM model’s overall accuracy is 70.0%, and its optimized Sensitivity and Specificity are 73.0% and 69.8%, respectively. By optimizing our Sensitivity and Specificity, we see that our Pos Pred Value has dropped to 19.0%.
Estimated profits using an optimized model:
N.Cust <- 2300
N.booksold <- 149
N.mailing <- 634
Profit.Opt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.Opt <- Profit.Opt / N.Cust
Profit.Opt
## [1] 1010.85
Prof.per.Cust.Opt
## [1] 0.4395
Even though our Pos Pred Value has dropped to 19.0%, the optimized Sensitivity and Specificity model yields more profit than the un-tuned. If BBBC used the optimized GLM predictions to direct its mailing, it would bring in a profit of $1010.85 or 44 cents per customer.
Percent improvement of optimized to un-optimized GLM.
#The % improvement in profit for using optimized.
glm.tuned.pct.imp <- ((Profit.Opt - Profit.unOpt) / Profit.unOpt) * 100
glm.tuned.pct.imp
## [1] 58.49012
How many “1” in the test set?
Test.Choice.1 <- sum(BBBC.test$Choice == 1)
Test.Choice.1
## [1] 204
Test.Choice.0 <- sum(BBBC.test$Choice == 0)
Test.Choice.0
## [1] 2096
Test.Choice.0+Test.Choice.1
## [1] 2300
Training.Choice.1 <- sum(BBBC.train$Choice == 1)
Training.Choice.1
## [1] 400
Training.Choice.0 <- sum(BBBC.train$Choice == 0)
Training.Choice.0
## [1] 1200
Training.Choice.0+Training.Choice.1
## [1] 1600
If Bookbinders didn’t bother using a model and sent the offer to the entire list?
N.Cust <- 2300
N.booksold <- 204
N.mailing <- 2096
Profit.NoMod <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.NoMod <- Profit.NoMod / N.Cust
Profit.NoMod
## [1] 585.8
Prof.per.Cust.NoMod
## [1] 0.2546957
If BBBC sent the mailing list out to its entire customer base in the test set, its profit would be $585.80, or 25 cents per customer.
Means that the optimized model’s % boost in profit to Bookbinders is:
#The % improvement in profit for using optimized.
glm.tuned.pct.boost <- ((Profit.Opt - Profit.NoMod) / Profit.NoMod) * 100
glm.tuned.pct.boost
## [1] 72.55889
If BBBC uses a tuned GLM to selectively market their Art History of Florence, they may expect approximately 68% more profit above a general distribution campaign.
For a 50,000 customer situation, the profit boost amounts to:
50000*(Prof.per.Cust.Opt - Prof.per.Cust.NoMod)
## [1] 9240.217
Applying the test sample results to a 50,000 population, BBBC may experience an additional $8691.30 in profit.
There is another method of tuning, and that is to tune the model to minimize its error.
First we establish a cut sequence, establishing the array of selection thresholds we wish to evaluate.
cut.seq <- seq(0.01, 0.99, by = 0.01)
Create a new glm intended for cut sequence evaluation.
seq.glm <- glm(Choice ~ P_Art + Frequency + Gender + P_Cook + P_Child + P_DIY, family = "binomial", data = BBBC.pseudo.train)
Run the new glm through an optimizing script to minimize error.
err=c();
for(i in 1:length(cut.seq)){
est.prob = predict(seq.glm, newdata=BBBC.pseudo.test[ ,-1], type="response") # prediction on psuedo.test
est.class = ifelse(est.prob > cut.seq[i], 1, 0) # if p-hat >= cut-off then class=1
# if p-hat < cut-off then class=0
err[i] = mean(BBBC.pseudo.test.true != est.class, na.rm=TRUE) ## error rate
}
par(mfrow=c(1,1))
plot(cut.seq, err, type="l")
min(err)
## [1] 0.2104167
cut.seq[which.min(err)]
## [1] 0.55
final.cut = cut.seq[which.min(err)]
seq.glm.probs <- predict(seq.glm, BBBC.test, type = "response")
seq.glm.preds <- rep(0, length(seq.glm.probs))
seq.glm.preds[seq.glm.probs > final.cut] = 1
caret::confusionMatrix(as.factor(seq.glm.preds), BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2023 145
## 1 73 59
##
## Accuracy : 0.9052
## 95% CI : (0.8925, 0.9169)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.856
##
## Kappa : 0.3026
##
## Mcnemar's Test P-Value : 1.519e-06
##
## Sensitivity : 0.28922
## Specificity : 0.96517
## Pos Pred Value : 0.44697
## Neg Pred Value : 0.93312
## Prevalence : 0.08870
## Detection Rate : 0.02565
## Detection Prevalence : 0.05739
## Balanced Accuracy : 0.62719
##
## 'Positive' Class : 1
##
N.Cust <- 2300
N.booksold <- 59
N.mailing <- 73
Profit.Opt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.Opt <- Profit.Opt / N.Cust
Profit.Opt
## [1] 516
Prof.per.Cust.Opt
## [1] 0.2243478
By optimizing the model to minimize error, the model’s accuracy is now 90.5%, but this is because the model is indicating “0” for the majority of observations in the testing dataset.
The profit here is $516.00, or 22 cents per customer.
For the GLM models, profits and profit per customer break down like this:
BBBC.test dataset: $585.80 profit, 25 cents per customer.SVM has the following advantages:
SVM has the following disadvantages: * Tuning performance parameters is critical, since the technique is very sensitive to them + Every time a new kernel is evaluated, the model must be re-tuned, which can take time and is computationally expensive * No interpretation is possible, we cannot derive general “rules of thumb” for who is more or less likely to make a purchase
Tuning a SVM on training dataset and unleashing it upon testing dataset. The tuning component runs through a sequence of tuning parameters (gamma and cost). Starting kernel is linear, the simplest. The next would be RBF (radial basis function).
set.seed(1)
tuned.svm <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'linear',
gamma = seq(0.01, 0.1, by = 0.025),
cost = seq(0.1, 1.2, by=0.1), scale=TRUE)
tuned.svm$best.parameters
## gamma cost
## 25 0.01 0.7
Creating the SVM using the tuned parameters:
svm1 <- svm(Choice ~ ., data = BBBC.train, kernel = 'linear',
gamma = tuned.svm$best.parameters$gamma,
cost = tuned.svm$best.parameters$cost)
Checking predictions for SVM1 against the training data:
svm.preds <- predict(svm1, BBBC.train)
caret::confusionMatrix(svm.preds, BBBC.train$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1145 267
## 1 55 133
##
## Accuracy : 0.7988
## 95% CI : (0.7783, 0.8181)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 2.363e-06
##
## Kappa : 0.3482
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.33250
## Specificity : 0.95417
## Pos Pred Value : 0.70745
## Neg Pred Value : 0.81091
## Prevalence : 0.25000
## Detection Rate : 0.08313
## Detection Prevalence : 0.11750
## Balanced Accuracy : 0.64333
##
## 'Positive' Class : 1
##
Overall training accuracy of the tuned SVM is 79.9%. Sensitivity and specificity are ~33% and ~95%, respectively.
Running SVM on testing data:
svm.preds <- predict(svm1, BBBC.test)
caret::confusionMatrix(svm.preds, BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2011 147
## 1 85 57
##
## Accuracy : 0.8991
## 95% CI : (0.8861, 0.9111)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9802
##
## Kappa : 0.2768
##
## Mcnemar's Test P-Value : 6.206e-05
##
## Sensitivity : 0.27941
## Specificity : 0.95945
## Pos Pred Value : 0.40141
## Neg Pred Value : 0.93188
## Prevalence : 0.08870
## Detection Rate : 0.02478
## Detection Prevalence : 0.06174
## Balanced Accuracy : 0.61943
##
## 'Positive' Class : 1
##
Tuned linear kernel SVM achieves overall testing accuracy of 89.9%, with Sensitivity and Specificity equal to 27.9% and 95.9%, respectively.
Profit numbers on linear kernel SVM
N.Cust <- 2300
N.booksold <- 57
N.mailing <- 85
Profit.LSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.LSVM <- Profit.LSVM / N.Cust
Profit.LSVM
## [1] 489.1
Prof.per.Cust.LSVM
## [1] 0.2126522
Using a tuned linear SVM as the model, BBBC would realize a profit of $489.1 or 0.21 cents per customer. This is less than what they would realize using a general mailing campaign with no model at all.
Tuning on RBF:
set.seed(1)
tuned.svm.rad <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'radial',
gamma = seq(0.01, 0.1, by = 0.01),
cost = seq(0.1, 1.2, by=0.25), scale=TRUE)
tuned.svm.rad$best.parameters
## gamma cost
## 35 0.05 0.85
svm2 <- svm(Choice ~ ., data = BBBC.train, kernel = 'radial',
gamma = tuned.svm.rad$best.parameters$gamma,
cost = tuned.svm.rad$best.parameters$cost)
svm.preds.rad <- predict(svm2, BBBC.test)
caret::confusionMatrix(svm.preds.rad, BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2054 166
## 1 42 38
##
## Accuracy : 0.9096
## 95% CI : (0.8971, 0.921)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.6327
##
## Kappa : 0.2291
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.18627
## Specificity : 0.97996
## Pos Pred Value : 0.47500
## Neg Pred Value : 0.92523
## Prevalence : 0.08870
## Detection Rate : 0.01652
## Detection Prevalence : 0.03478
## Balanced Accuracy : 0.58312
##
## 'Positive' Class : 1
##
Tuned radial basis function kernel SVM achieves overall testing accuracy of approximately 91%, with Sensitivity and Specificity of approximately 19% and 98%, respectively.
Profit numbers:
N.Cust <- 2300
N.booksold <- 38
N.mailing <- 42
Profit.RSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.RSVM <- Profit.RSVM / N.Cust
Profit.RSVM
## [1] 335.6
Prof.per.Cust.RSVM
## [1] 0.145913
Despite the high overall accuracy, the tuned RBF SVM model does not really help BBBC increase its profits. The profit realized by using the advice from tuned RBF SVM is $335.60 or 0.15 cents per customer. This is getting worse, not better!
Evaluating a polynomial kernel SVM:
set.seed(1)
tuned.svm.poly <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'polynomial',
gamma = seq(0.01, 0.1, by = 0.01),
cost = seq(0.1, 1.2, by=0.25), scale=TRUE)
tuned.svm.poly$best.parameters
## gamma cost
## 40 0.1 0.85
svm3 <- svm(Choice ~ ., data = BBBC.train, kernel = 'polynomial',
gamma = tuned.svm.poly$best.parameters$gamma,
cost = tuned.svm.poly$best.parameters$cost)
svm.preds.poly <- predict(svm3, BBBC.test)
caret::confusionMatrix(svm.preds.poly, BBBC.test$Choice, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2049 168
## 1 47 36
##
## Accuracy : 0.9065
## 95% CI : (0.8939, 0.9181)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.8013
##
## Kappa : 0.2104
##
## Mcnemar's Test P-Value : 2.747e-16
##
## Sensitivity : 0.17647
## Specificity : 0.97758
## Pos Pred Value : 0.43373
## Neg Pred Value : 0.92422
## Prevalence : 0.08870
## Detection Rate : 0.01565
## Detection Prevalence : 0.03609
## Balanced Accuracy : 0.57702
##
## 'Positive' Class : 1
##
The SVM with polynomial kernel has an overall testing accuracy of 91%, with Sensitivity and Specificity of 17.6% and 97.8%, respectively.
SVM with poly kernel profit numbers:
N.Cust <- 2300
N.booksold <- 36
N.mailing <- 47
Profit.PSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.PSVM <- Profit.PSVM / N.Cust
Profit.PSVM
## [1] 313.25
Prof.per.Cust.PSVM
## [1] 0.1361957
The more complex the SVM gets, the deeper into the hole it digs. At this point BBBC would realize a profit of $313.25 or 0.14 cents per customer from the test set.
For a general mailing campaign, where there is no model used, BBBC would realize a potential profit of $585.80 or 0.25 cents per customer.