DA-6813: Case Study 2

Read in the train and test XLS sheets.

BBBC.train <- read_excel("~/2021 Spring MSDA/DA-6813 Data Analytics Applications/Case Study 2/BBBC-Train.xlsx")
BBBC.test <- read_excel("~/2021 Spring MSDA/DA-6813 Data Analytics Applications/Case Study 2/BBBC-Test.xlsx")

The client has provided us with more testing observations than training observations, with the training data set numbering 1600 observations and the testing dataset numbering 2300 observations. Larger testing datasets may ensure a more accurate calculation of model performance to the full population from which it was drawn.

Observation is a variable that is not useful to us. Removing it.

Both Choice and Gender are binary and categorical. Setting them to be factors early on.

BBBC.train$Observation <- NULL
BBBC.test$Observation <- NULL
BBBC.train$Choice <- as.factor(BBBC.train$Choice)
BBBC.train$Gender <- as.factor(BBBC.train$Gender)
BBBC.test$Choice <- as.factor(BBBC.test$Choice)
BBBC.test$Gender <- as.factor(BBBC.test$Gender)

#Create pseudo-train and pseudo-test as 70/30 split from BBBC.train
#These will be used to train the GLM
set.seed(2021)
pseudo.index <- sample(1:nrow(BBBC.train), 0.7*nrow(BBBC.train))
BBBC.pseudo.train <- BBBC.train[pseudo.index, ]
BBBC.pseudo.test <- BBBC.train[-pseudo.index, ]
BBBC.pseudo.test.true <- BBBC.pseudo.test$Choice

Exploratory data analysis

Let’s take a quick look at the variables for BBBC.train and see if there’s any relations:

pairs(BBBC.train)

Looking at a table might provide useful data as well:

cor(BBBC.train[sapply(BBBC.train, is.numeric)])

##                  Amount_purchased     Frequency Last_purchase First_purchase
## Amount_purchased       1.00000000  0.0136664846    0.44070127      0.3748139
## Frequency              0.01366648  1.0000000000   -0.04194328      0.4459457
## Last_purchase          0.44070127 -0.0419432803    1.00000000      0.8146747
## First_purchase         0.37481393  0.4459457457    0.81467469      1.0000000
## P_Child                0.29931372 -0.0433279437    0.67913392      0.5448208
## P_Youth                0.18755727 -0.0095854745    0.45325891      0.3678921
## P_Cook                 0.30425340  0.0004968833    0.67250539      0.5710548
## P_DIY                  0.22331539 -0.0089634125    0.55816739      0.4620188
## P_Art                  0.27248948 -0.0613754066    0.53433415      0.4420821
##                      P_Child      P_Youth       P_Cook        P_DIY       P_Art
## Amount_purchased  0.29931372  0.187557270 0.3042533969  0.223315392  0.27248948
## Frequency        -0.04332794 -0.009585474 0.0004968833 -0.008963412 -0.06137541
## Last_purchase     0.67913392  0.453258910 0.6725053933  0.558167395  0.53433415
## First_purchase    0.54482083  0.367892128 0.5710547918  0.462018843  0.44208206
## P_Child           1.00000000  0.174826719 0.2947065185  0.253837077  0.22451285
## P_Youth           0.17482672  1.000000000 0.1816566401  0.188683456  0.14175122
## P_Cook            0.29470652  0.181656640 1.0000000000  0.271725126  0.19168076
## P_DIY             0.25383708  0.188683456 0.2717251256  1.000000000  0.20779106
## P_Art             0.22451285  0.141751220 0.1916807611  0.207791065  1.00000000

Highest correlations are between First_purchase and Last_purchase, which makes sense in a way. An only customer makes a purchase and walks out, that may be both a First_purchase and Last_purchase observation. Repeat customers will have a First_purchase and a smaller Last_purchase. These may influence the results of our model.

Let’s examine boxplots of the non-categorical variables to get an idea of how the independent variables are distributed across the entire BBBC.train dataset.

par(mfrow=c(3,3))
hist(BBBC.train$Amount_purchased, xlab = "Amount Purchased")
hist(BBBC.train$Frequency, xlab = "Total Num. Purchases in Study")
hist(BBBC.train$Last_purchase, xlab = "Months Since Last Purchase")
hist(BBBC.train$First_purchase, xlab = "Months Since First Purchase")
hist(BBBC.train$P_Child, xlab = "No. Children's Books Purchased")
hist(BBBC.train$P_Youth, xlab = "No. Youth Books Purchased")
hist(BBBC.train$P_Cook, xlab = "No. Cook Books Purchased")
hist(BBBC.train$P_DIY, xlab = "No. DIY Books Purchased")
hist(BBBC.train$P_Art, xlab = "No. Art Books Purchased")

The independent variable Amount_purchased is about the only variable that looks normally distributed. All of the others show a distinct right skew to the data, which may affect a linear regression model.

LINEAR REGRESSION MODEL

Linear regression has certain assumptions:

There is a linear relation between dependent variable and independent variables
The dependent variable has a normal distribution across the independent variables. This model has a binary dependent variable, so this assumption is violated already.
There is little or no multicollinearity
There is little or no auto-correlation
There is no heteroscedasticity

In a linear regression model, it is possible to have the dependent variable go from negative to positive infinity. This is not compatible with a binary dependent variable, and is not suited to this case.

Linear regression modeling may be useful to BBBC if were they to be interested in forecasting revenue levels from a particular genre or title, and to draw inferences on what variables most affect genre profits.

Creating a LM with all variables.

lm1 <- lm(as.numeric(Choice) ~ ., data = BBBC.train)
summary(lm1)

## 
## Call:
## lm(formula = as.numeric(Choice) ~ ., data = BBBC.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9603 -0.2462 -0.1161  0.1622  1.0588 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3642284  0.0307411  44.378  < 2e-16 ***
## Gender1          -0.1309205  0.0200303  -6.536 8.48e-11 ***
## Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
## Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
## Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
## First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
## P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
## P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
## P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
## P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
## P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2353 
## F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16

OK, we see a lot of variables that are significant. Are any of these going to cause problems from a multicollinearity point of view?

vif(lm1)

##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.005801         1.248066         3.253860        18.770402 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.685333         3.360349         1.775022         3.324928 
##            P_DIY            P_Art 
##         2.016910         2.273771

Yep. Last_purchase has a really high value. Let’s remove it and see if things get a little cleaner for our model.

lm1.novif <- lm(as.numeric(Choice) ~ . - Last_purchase, BBBC.train)
vif(lm1.novif)

##           Gender Amount_purchased        Frequency   First_purchase 
##         1.005634         1.235982         2.651820         7.182666 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.949849         1.307915         2.009609         1.457362 
##            P_Art 
##         1.634878

OK, let’s call that good and run with that model.

summary(lm1.novif)

## 
## Call:
## lm(formula = as.numeric(Choice) ~ . - Last_purchase, data = BBBC.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0018 -0.2482 -0.1277  0.1567  1.1035 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3926595  0.0309609  44.981  < 2e-16 ***
## Gender1          -0.1290720  0.0203424  -6.345 2.89e-10 ***
## Amount_purchased  0.0003518  0.0001122   3.135 0.001753 ** 
## Frequency        -0.0157943  0.0019980  -7.905 4.97e-15 ***
## First_purchase    0.0046036  0.0015884   2.898 0.003803 ** 
## P_Child          -0.0502183  0.0126891  -3.958 7.90e-05 ***
## P_Youth          -0.0225339  0.0175326  -1.285 0.198888    
## P_Cook           -0.0667467  0.0131127  -5.090 4.00e-07 ***
## P_DIY            -0.0606486  0.0170835  -3.550 0.000396 ***
## P_Art             0.1916012  0.0167447  11.443  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3847 on 1590 degrees of freedom
## Multiple R-squared:  0.2156, Adjusted R-squared:  0.2111 
## F-statistic: 48.55 on 9 and 1590 DF,  p-value: < 2.2e-16

Let’s look at the diagnostics plots of the initial linear model to see if we have any violations of the model’s assumptions:

par(mfrow=c(2,2))
plot(lm1.novif, which=c(1:4))

It looks like there’s non-normality in the model response.

LR with stepwise model selection, based on p-value

Creating a LM using stepwise model selection, let’s go with a selection threshold of 0.05 and see if it produces a cleaner model.

lm1.stepwise <- ols_step_both_p(lm1, pent = 0.05, prem = 0.05, details = F)
lm1.stepwise

## 
##                                   Stepwise Selection Summary                                    
## -----------------------------------------------------------------------------------------------
##                              Added/                   Adj.                                         
## Step        Variable        Removed     R-Square    R-Square      C(p)         AIC        RMSE     
## -----------------------------------------------------------------------------------------------
##    1         P_Art          addition       0.128       0.127    227.4360    1649.2034    0.4046    
##    2       Frequency        addition       0.170       0.169    142.0340    1572.6125    0.3949    
##    3         Gender         addition       0.188       0.186    106.5840    1539.7192    0.3908    
##    4         P_Cook         addition       0.200       0.198     82.3710    1516.8351    0.3879    
##    5         P_DIY          addition       0.204       0.201     77.1250    1511.8878    0.3871    
##    6    Amount_purchased    addition       0.208       0.205     70.8160    1505.8838    0.3863    
##    7        P_Child         addition       0.211       0.208     64.8840    1500.2051    0.3855    
##    8     Last_purchase      addition       0.228       0.224     31.7060    1467.7006    0.3815    
##    9        P_Youth         addition       0.239       0.235     10.2160    1446.2388    0.3788    
## -----------------------------------------------------------------------------------------------

Create the stepwise LM.

lm1.step <- lm(as.numeric(Choice)~ P_Art + Frequency + Gender + P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth, BBBC.train)
summary(lm1.step)

## 
## Call:
## lm(formula = as.numeric(Choice) ~ P_Art + Frequency + Gender + 
##     P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + 
##     P_Youth, data = BBBC.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9802 -0.2452 -0.1157  0.1655  1.0595 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3727367  0.0297590  46.128  < 2e-16 ***
## P_Art             0.1150034  0.0192719   5.967 2.97e-09 ***
## Frequency        -0.0110830  0.0012128  -9.138  < 2e-16 ***
## Gender1          -0.1316464  0.0200208  -6.575 6.56e-11 ***
## P_Cook           -0.1433497  0.0165218  -8.676  < 2e-16 ***
## P_DIY            -0.1365578  0.0197520  -6.914 6.82e-12 ***
## Amount_purchased  0.0002742  0.0001110   2.470   0.0136 *  
## P_Child          -0.1275991  0.0163571  -7.801 1.11e-14 ***
## Last_purchase     0.0894288  0.0116772   7.658 3.25e-14 ***
## P_Youth          -0.0973642  0.0200903  -4.846 1.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared:  0.2395, Adjusted R-squared:  0.2352 
## F-statistic: 55.63 on 9 and 1590 DF,  p-value: < 2.2e-16

Quick interpretation:

There is enough statistical evidence for us to reject the null hypothesis that all model coefficients are equal to zero. This model has something to tell us.
The following variables are significant in this model:
- P_Art Customer who purchase more art books might purchase more more copies of The Art History of Florence
- Frequency Customers who purchased more books during the chosen might purchase fewer copies of TAHF
- Gender1 (male) On average the gentlemen might purchase fewer copies of TAHF than the ladies
- P_Cook Customers who purchase more cooking books might purchase fewer copies of TAHF
- P_DIY Customers who purchase more DIY books might purchase fewer copies of TAHF
- Amount_purchased Customers who spent more on books might be more likely to purchase TAHF
- P_Child Customers who purchase more children’s books might purchase fewer copies of TAHF
- Last_purchase Customers who haven’t purchased a book in a while might purchase more copies of TAHF
- P_Youth Customers who purchase more youth books might purchase fewer copies of TAHF

Diagnostics of stepwise model:

par(mfrow=c(2,2))
plot(lm1.step, which=c(1:4))

Check VIF:

vif(lm1.step)

##            P_Art        Frequency           Gender           P_Cook 
##         2.233698         1.007855         1.004715         3.290657 
##            P_DIY Amount_purchased          P_Child    Last_purchase 
##         2.009454         1.248033         3.341880        13.920175 
##          P_Youth 
##         1.771355

We still see that Last_purchase has a very high VIF value, and the QQ plot shows a non-normal trace in the data.

Linear regression is not a good method to provide inferences on whether a customer will buy a book or not. The model does not take into consideration the binary nature of the variable Choice, and in fact can extend to either positive or negative infinity. A linear regression model may be useful for illustrating total revenue trends of P_Art, but it is not a good choice for classification or prediction.

Given that the data is already pushing the limits of the assumptions of linear regression we really should NOT be putting money on a linear regression model!

LOGIT model

The logistic regression model is better suited to binary classification and prediction than linear regression. Logistic regression does not follow the same assumptions as linear regression:

It does not require a multivariate normal distribution
It does not require a linear relation between the dependent and independent variables
It does not require require homoscedasticity
The dependent variable is not measured on an interval or ratio scale

Logistic regression has the following assumptions:

For binary logistic regression the DV is required to be binary in nature (as is this case). For ordinal logistic regression, the DV is required to be ordinal
The dependent variable follows a Bernoulli distribution (like a coin toss), where the probability of event P depends upon X variables
The observations must be independent of one another, and should not come from matched data or repeated measurements
Multicollinearity should be investigated and eliminated. Independent variables should not be correlated with one another
The log-odds and independent variables should have a linear relation
Logistic regression works best with large sample sizes.

Constructing base logistic regression model and examining the variance inflation factors:

glm.base <- glm(Choice ~ ., BBBC.pseudo.train, family = "binomial")
vif(glm.base)

##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.019640         1.251838         2.717041        16.997938 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.356885         2.821611         1.708766         3.104496 
##            P_DIY            P_Art 
##         2.044643         1.924196

Last_purchase has high VIF, so we need to remove it first, then evaluate VIF again.

glm.base<-glm(Choice ~ . - Last_purchase, BBBC.pseudo.train, family="binomial")
vif(glm.base)

##           Gender Amount_purchased        Frequency   First_purchase 
##         1.016868         1.238338         2.331275         7.062058 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.887517         1.352018         2.057100         1.511540 
##            P_Art 
##         1.637684

First_purchase has highest VIF this time, so removing it, too.

glm.base <- glm(Choice ~ . - Last_purchase - First_purchase, BBBC.pseudo.train, family = "binomial")
vif(glm.base)

##           Gender Amount_purchased        Frequency          P_Child 
##         1.018346         1.234808         1.019653         1.247260 
##          P_Youth           P_Cook            P_DIY            P_Art 
##         1.057154         1.251634         1.188582         1.258966

The VIF scores look good. Let’s proceed with this as our base full model and run stepwise model selection.

#We're building these on the pseudo.train dataset
glm.null <- glm(Choice ~ 1, BBBC.pseudo.train, family = "binomial")
glm.full <- glm(Choice ~ . - Last_purchase - First_purchase, BBBC.pseudo.train, family = "binomial")
glm.step <- step(glm.null, scope = list(upper = glm.full),
                 direction = "both", test="Chisq", trace = F)
summary(glm.step)

## 
## Call:
## glm(formula = Choice ~ P_Art + Frequency + Gender + P_Cook + 
##     P_DIY + P_Child + Amount_purchased, family = "binomial", 
##     data = BBBC.pseudo.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2847  -0.6829  -0.4688  -0.1649   2.8352  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.1808844  0.2399106  -0.754  0.45087    
## P_Art             1.2319402  0.1198115  10.282  < 2e-16 ***
## Frequency        -0.0886377  0.0123896  -7.154 8.42e-13 ***
## Gender1          -0.8779051  0.1610590  -5.451 5.01e-08 ***
## P_Cook           -0.2551691  0.0898045  -2.841  0.00449 ** 
## P_DIY            -0.3468313  0.1314464  -2.639  0.00833 ** 
## P_Child          -0.1816776  0.0884965  -2.053  0.04008 *  
## Amount_purchased  0.0018811  0.0009267   2.030  0.04236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1234.9  on 1119  degrees of freedom
## Residual deviance: 1001.6  on 1112  degrees of freedom
## AIC: 1017.6
## 
## Number of Fisher Scoring iterations: 5

Interpretation:

The odds of buying a copy of The Art History of Florence change by a factor of e^1.23 (3.43) with every additional art book purchased, when all other predictors are the same.
- Art book enthusiasts are more likely to buy TAHF.
The odds of buying a copy of TAHF change by a factor of e^-0.09 (0.915) with every unit increase in Frequency (total number of purchases in the chosen period), when all other predictors are the same.
- General book buyers with are not as likely to buy TAHF.
The odds of a male customer buying a copy of TAHF are e^-0.88 (0.416) times that of a female customer, when all other predictors are the same.
- Male customers are less likely to buy TAHF compared to women.
The odds of buying a copy of TAHF change by a factor of e^-0.26 (0.775) with every additional cookbook purchased, when all other predictors are the same.
- Cooking enthusiasts are less likely to buy TAHF.
The odds of buying a copy of TAHF change by a factor of e^-0.35 (0.707) with every additional do-it-yourself books purchased, when all other predictors are the same.
- DIY enthusiasts are less likely to buy TAHF.
The odds of buying a copy of TAHF change by a factor of e^-0.18 (834) with every additional children’s book purchased, when all other predictors are the same.
- Children’s book buyers are less likely to buy TAHF.
The odds of buying a copy of TAHF change by a factor of e^0.002 (1.002) with every additional dollar spent on BBBC books.
- Readers who make big purchases are more likely to buy TAHF.

Goodness-of-fit Test

hoslem.test(glm.step$y, fitted(glm.step), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.step$y, fitted(glm.step)
## X-squared = 7.1344, df = 8, p-value = 0.5222

The null hypothesis of the Hosmer Lemeshow test is that the model is adequate and fits the data well. The alternate hypothesis is that the model is not adequate and does not fit the data well. Here we see that the p-value of the Hosmer Lemeshow test indicates we fail to reject the null hypothesis, so we can say that our stepwise logistical model fits the data well.

Check Cook’s distance to see if there are any influential points:

plot(glm.step, which = 4)

R identifies three points (this is by default), but none of them exceed a Cook’s distance of 0.03. Given we have no guidance on influential points, let’s leave them as is.

Checking the model’s pseudo-R^2:

glm.pseudoR2 <- PseudoR2(glm.step, which = c("McFadden","Nagel","CoxSnell"))
round(glm.pseudoR2, 3)

##   McFadden Nagelkerke   CoxSnell 
##      0.189      0.281      0.188

The Pseudo R^2 of this model ranges from 18.9% for McFadden to 28.1% for Nagelkerke.

Evaluate the predictive performance of the glm.step model on our training set with a selection threshold of 0.5 and use a Confusion Matrix to evaluate model accuracy, and tune it if needed, before applying it to the test set.

glm.step.probs <- predict(glm.step, BBBC.pseudo.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.5] = 1

caret::confusionMatrix(as.factor(glm.step.preds), BBBC.pseudo.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 326  80
##          1  23  51
##                                           
##                Accuracy : 0.7854          
##                  95% CI : (0.7459, 0.8213)
##     No Information Rate : 0.7271          
##     P-Value [Acc > NIR] : 0.001995        
##                                           
##                   Kappa : 0.3743          
##                                           
##  Mcnemar's Test P-Value : 3.432e-08       
##                                           
##             Sensitivity : 0.3893          
##             Specificity : 0.9341          
##          Pos Pred Value : 0.6892          
##          Neg Pred Value : 0.8030          
##              Prevalence : 0.2729          
##          Detection Rate : 0.1062          
##    Detection Prevalence : 0.1542          
##       Balanced Accuracy : 0.6617          
##                                           
##        'Positive' Class : 1               
##

The unoptimized model’s overall accuracy is 78.5%, it doesn’t score well at all with predicting “1” Choice from the reference “1” Choice, and has a Sensitivity (Recall) of 38.9%. It does a decent job at predicting “0” Choice from the reference “0” Choice with a specificity of 93.4%.

In the case of applying a predictive marketing model to a profit-making endeavour, we would rely upon shipping only to cases where the model predicts the Choice would be “1”. This corresponds to the Pos Pred Value in the Confusion Matrix above, given as TP/(TP+FP). That value is 68.9% on the training data.

Evaluating this un-tuned, un-optimized model against the testing set:

glm.step.probs <- predict(glm.step, BBBC.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.5] = 1

caret::confusionMatrix(as.factor(glm.step.preds), BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1990  130
##          1  106   74
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8843, 0.9095)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9904          
##                                           
##                   Kappa : 0.3297          
##                                           
##  Mcnemar's Test P-Value : 0.1343          
##                                           
##             Sensitivity : 0.36275         
##             Specificity : 0.94943         
##          Pos Pred Value : 0.41111         
##          Neg Pred Value : 0.93868         
##              Prevalence : 0.08870         
##          Detection Rate : 0.03217         
##    Detection Prevalence : 0.07826         
##       Balanced Accuracy : 0.65609         
##                                           
##        'Positive' Class : 1               
##

If we run with an un-optimized GLM, what would our profit be?

VC.mail.posting <- 0.65
VC.book <- 15
VC.book.OH <- VC.book*0.45
P.book <- 31.95
Prof.book <- P.book - VC.mail.posting - VC.book - VC.book.OH

N.Cust <- 2300
N.booksold <- 74
N.mailing <- 106

Profit.unOpt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.unOpt <- Profit.unOpt / N.Cust
Profit.unOpt

## [1] 637.8

Prof.per.Cust.unOpt

## [1] 0.2773043

Running a model where we haven’t optimized the Sensitivity / Specificity means that if BBBC used the unoptimized model’s predictions, and marketed only to those people, they would pull in a profit of $637.8 from the test set, or 28 cents per customer based on the test set.

Tune the selection threshold to optimize both Sensitivity and Specificity before predicting against the test set.

Unleash the Tuning Kraken!

#ROC Curve and AUC
pr = predict(glm.step, BBBC.pseudo.test, type = "response") # make prediction with last built model on training set
response = BBBC.pseudo.test$Choice

pred <- prediction(pr, response) #Predicted Probability and True Classification

# area under curve
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

# some important statistics
false.rates <-performance(pred, "fpr","fnr")
accuracy <-performance(pred, "acc","err")
perf <- performance(pred, "tpr","fpr")

#plotting the ROC curve and computing AUC
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

# computing threshold for cutoff to best trade off sensitivity and specificity
#first sensitivity
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE) # plot another line in same plot

#second specificity
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

#find where the lines intersect
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x #this is the optimal points to best trade off sensitivity and specificity

abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)

Our optimal prediction threshold is 0.22. Now that we have tuned our GLM on the training set, let’s see how it performs on the test set (fingers crossed)!

glm.step.probs <- predict(glm.step, BBBC.test, type = "response")
glm.step.preds <- rep(0, length(glm.step.probs))
glm.step.preds[glm.step.probs > 0.22] = 1

caret::confusionMatrix(as.factor(glm.step.preds), BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1462   55
##          1  634  149
##                                           
##                Accuracy : 0.7004          
##                  95% CI : (0.6813, 0.7191)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1876          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.73039         
##             Specificity : 0.69752         
##          Pos Pred Value : 0.19029         
##          Neg Pred Value : 0.96374         
##              Prevalence : 0.08870         
##          Detection Rate : 0.06478         
##    Detection Prevalence : 0.34043         
##       Balanced Accuracy : 0.71396         
##                                           
##        'Positive' Class : 1               
##

The optimized GLM model’s overall accuracy is 70.0%, and its optimized Sensitivity and Specificity are 73.0% and 69.8%, respectively. By optimizing our Sensitivity and Specificity, we see that our Pos Pred Value has dropped to 19.0%.

Estimated profits using an optimized model:

N.Cust <- 2300
N.booksold <- 149
N.mailing <- 634

Profit.Opt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.Opt <- Profit.Opt / N.Cust
Profit.Opt

## [1] 1010.85

Prof.per.Cust.Opt

## [1] 0.4395

Even though our Pos Pred Value has dropped to 19.0%, the optimized Sensitivity and Specificity model yields more profit than the un-tuned. If BBBC used the optimized GLM predictions to direct its mailing, it would bring in a profit of $1010.85 or 44 cents per customer.

Percent improvement of optimized to un-optimized GLM.

#The % improvement in profit for using optimized.
glm.tuned.pct.imp <- ((Profit.Opt - Profit.unOpt) / Profit.unOpt) * 100

glm.tuned.pct.imp

## [1] 58.49012

How many “1” in the test set?

Test.Choice.1 <- sum(BBBC.test$Choice == 1)
Test.Choice.1

## [1] 204

Test.Choice.0 <- sum(BBBC.test$Choice == 0)
Test.Choice.0

## [1] 2096

Test.Choice.0+Test.Choice.1

## [1] 2300

Training.Choice.1 <- sum(BBBC.train$Choice == 1)
Training.Choice.1

## [1] 400

Training.Choice.0 <- sum(BBBC.train$Choice == 0)
Training.Choice.0

## [1] 1200

Training.Choice.0+Training.Choice.1

## [1] 1600

If Bookbinders didn’t bother using a model and sent the offer to the entire list?

N.Cust <- 2300
N.booksold <- 204
N.mailing <- 2096

Profit.NoMod <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.NoMod <- Profit.NoMod / N.Cust
Profit.NoMod

## [1] 585.8

Prof.per.Cust.NoMod

## [1] 0.2546957

If BBBC sent the mailing list out to its entire customer base in the test set, its profit would be $585.80, or 25 cents per customer.

Means that the optimized model’s % boost in profit to Bookbinders is:

#The % improvement in profit for using optimized.
glm.tuned.pct.boost <- ((Profit.Opt - Profit.NoMod) / Profit.NoMod) * 100

glm.tuned.pct.boost

## [1] 72.55889

If BBBC uses a tuned GLM to selectively market their Art History of Florence, they may expect approximately 68% more profit above a general distribution campaign.

For a 50,000 customer situation, the profit boost amounts to:

50000*(Prof.per.Cust.Opt - Prof.per.Cust.NoMod)

## [1] 9240.217

Applying the test sample results to a 50,000 population, BBBC may experience an additional $8691.30 in profit.

There is another method of tuning, and that is to tune the model to minimize its error.

First we establish a cut sequence, establishing the array of selection thresholds we wish to evaluate.

cut.seq <- seq(0.01, 0.99, by = 0.01)

Create a new glm intended for cut sequence evaluation.

seq.glm <- glm(Choice ~ P_Art + Frequency + Gender + P_Cook + P_Child + P_DIY, family = "binomial", data = BBBC.pseudo.train)

Run the new glm through an optimizing script to minimize error.

err=c(); 
for(i in 1:length(cut.seq)){
  
  est.prob = predict(seq.glm, newdata=BBBC.pseudo.test[ ,-1], type="response")  # prediction on psuedo.test
  est.class = ifelse(est.prob > cut.seq[i], 1, 0)    # if p-hat >= cut-off then class=1
  # if p-hat < cut-off then class=0
  
  err[i] = mean(BBBC.pseudo.test.true != est.class, na.rm=TRUE)  ## error rate 
}

par(mfrow=c(1,1))
plot(cut.seq, err, type="l")

min(err)

## [1] 0.2104167

cut.seq[which.min(err)]

## [1] 0.55

final.cut = cut.seq[which.min(err)]

seq.glm.probs <- predict(seq.glm, BBBC.test, type = "response")
seq.glm.preds <- rep(0, length(seq.glm.probs))
seq.glm.preds[seq.glm.probs > final.cut] = 1

caret::confusionMatrix(as.factor(seq.glm.preds), BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2023  145
##          1   73   59
##                                           
##                Accuracy : 0.9052          
##                  95% CI : (0.8925, 0.9169)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.856           
##                                           
##                   Kappa : 0.3026          
##                                           
##  Mcnemar's Test P-Value : 1.519e-06       
##                                           
##             Sensitivity : 0.28922         
##             Specificity : 0.96517         
##          Pos Pred Value : 0.44697         
##          Neg Pred Value : 0.93312         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02565         
##    Detection Prevalence : 0.05739         
##       Balanced Accuracy : 0.62719         
##                                           
##        'Positive' Class : 1               
##

N.Cust <- 2300
N.booksold <- 59
N.mailing <- 73

Profit.Opt <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.Opt <- Profit.Opt / N.Cust
Profit.Opt

## [1] 516

Prof.per.Cust.Opt

## [1] 0.2243478

By optimizing the model to minimize error, the model’s accuracy is now 90.5%, but this is because the model is indicating “0” for the majority of observations in the testing dataset.

The profit here is $516.00, or 22 cents per customer.

For the GLM models, profits and profit per customer break down like this:

GLM with no optimization: $637.80 profit, 28 cents per customer. Accuracy: 89.74%
GLM optimized for sensitivity and specificity: $1010.85 profit, 44 cents per customer. Accuracy: 70.0%
GLM optimized for minimal predictive error: $516.00 profit, 22 cents per customer. Accuracy: 90.5%
No model, using general mailing to all customers based on BBBC.test dataset: $585.80 profit, 25 cents per customer.

SVM Model

SVM has the following advantages:

The technique works well for classification in high dimensional spaces
The technique requires support vectors (some training points) to work, so it’s fairly memory efficient
The technique can use different kernels to make it more versatile, with some nonlinear kernels allowing for highly flexible decision boundaries

SVM has the following disadvantages: * Tuning performance parameters is critical, since the technique is very sensitive to them + Every time a new kernel is evaluated, the model must be re-tuned, which can take time and is computationally expensive * No interpretation is possible, we cannot derive general “rules of thumb” for who is more or less likely to make a purchase

Tuning a SVM on training dataset and unleashing it upon testing dataset. The tuning component runs through a sequence of tuning parameters (gamma and cost). Starting kernel is linear, the simplest. The next would be RBF (radial basis function).

set.seed(1)
tuned.svm <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'linear',
                      gamma = seq(0.01, 0.1, by = 0.025),
                      cost = seq(0.1, 1.2, by=0.1), scale=TRUE)
tuned.svm$best.parameters

##    gamma cost
## 25  0.01  0.7

Creating the SVM using the tuned parameters:

svm1 <- svm(Choice ~ ., data = BBBC.train, kernel = 'linear',
            gamma = tuned.svm$best.parameters$gamma,
            cost = tuned.svm$best.parameters$cost)

Checking predictions for SVM1 against the training data:

svm.preds <- predict(svm1, BBBC.train)
caret::confusionMatrix(svm.preds, BBBC.train$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1145  267
##          1   55  133
##                                           
##                Accuracy : 0.7988          
##                  95% CI : (0.7783, 0.8181)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 2.363e-06       
##                                           
##                   Kappa : 0.3482          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.33250         
##             Specificity : 0.95417         
##          Pos Pred Value : 0.70745         
##          Neg Pred Value : 0.81091         
##              Prevalence : 0.25000         
##          Detection Rate : 0.08313         
##    Detection Prevalence : 0.11750         
##       Balanced Accuracy : 0.64333         
##                                           
##        'Positive' Class : 1               
##

Overall training accuracy of the tuned SVM is 79.9%. Sensitivity and specificity are ~33% and ~95%, respectively.

Running SVM on testing data:

svm.preds <- predict(svm1, BBBC.test)
caret::confusionMatrix(svm.preds, BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2011  147
##          1   85   57
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8861, 0.9111)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9802          
##                                           
##                   Kappa : 0.2768          
##                                           
##  Mcnemar's Test P-Value : 6.206e-05       
##                                           
##             Sensitivity : 0.27941         
##             Specificity : 0.95945         
##          Pos Pred Value : 0.40141         
##          Neg Pred Value : 0.93188         
##              Prevalence : 0.08870         
##          Detection Rate : 0.02478         
##    Detection Prevalence : 0.06174         
##       Balanced Accuracy : 0.61943         
##                                           
##        'Positive' Class : 1               
##

Tuned linear kernel SVM achieves overall testing accuracy of 89.9%, with Sensitivity and Specificity equal to 27.9% and 95.9%, respectively.

Profit numbers on linear kernel SVM

N.Cust <- 2300
N.booksold <- 57
N.mailing <- 85

Profit.LSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.LSVM <- Profit.LSVM / N.Cust
Profit.LSVM

## [1] 489.1

Prof.per.Cust.LSVM

## [1] 0.2126522

Using a tuned linear SVM as the model, BBBC would realize a profit of $489.1 or 0.21 cents per customer. This is less than what they would realize using a general mailing campaign with no model at all.

Tuning on RBF:

set.seed(1)
tuned.svm.rad <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'radial',
                      gamma = seq(0.01, 0.1, by = 0.01),
                      cost = seq(0.1, 1.2, by=0.25), scale=TRUE)
tuned.svm.rad$best.parameters

##    gamma cost
## 35  0.05 0.85

svm2 <- svm(Choice ~ ., data = BBBC.train, kernel = 'radial',
            gamma = tuned.svm.rad$best.parameters$gamma,
            cost = tuned.svm.rad$best.parameters$cost)

svm.preds.rad <- predict(svm2, BBBC.test)
caret::confusionMatrix(svm.preds.rad, BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2054  166
##          1   42   38
##                                          
##                Accuracy : 0.9096         
##                  95% CI : (0.8971, 0.921)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 0.6327         
##                                          
##                   Kappa : 0.2291         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.18627        
##             Specificity : 0.97996        
##          Pos Pred Value : 0.47500        
##          Neg Pred Value : 0.92523        
##              Prevalence : 0.08870        
##          Detection Rate : 0.01652        
##    Detection Prevalence : 0.03478        
##       Balanced Accuracy : 0.58312        
##                                          
##        'Positive' Class : 1              
##

Tuned radial basis function kernel SVM achieves overall testing accuracy of approximately 91%, with Sensitivity and Specificity of approximately 19% and 98%, respectively.

Profit numbers:

N.Cust <- 2300
N.booksold <- 38
N.mailing <- 42

Profit.RSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.RSVM <- Profit.RSVM / N.Cust
Profit.RSVM

## [1] 335.6

Prof.per.Cust.RSVM

## [1] 0.145913

Despite the high overall accuracy, the tuned RBF SVM model does not really help BBBC increase its profits. The profit realized by using the advice from tuned RBF SVM is $335.60 or 0.15 cents per customer. This is getting worse, not better!

Evaluating a polynomial kernel SVM:

set.seed(1)
tuned.svm.poly <- tune.svm(Choice ~ ., data=BBBC.train, kernel = 'polynomial',
                      gamma = seq(0.01, 0.1, by = 0.01),
                      cost = seq(0.1, 1.2, by=0.25), scale=TRUE)
tuned.svm.poly$best.parameters

##    gamma cost
## 40   0.1 0.85

svm3 <- svm(Choice ~ ., data = BBBC.train, kernel = 'polynomial',
            gamma = tuned.svm.poly$best.parameters$gamma,
            cost = tuned.svm.poly$best.parameters$cost)

svm.preds.poly <- predict(svm3, BBBC.test)
caret::confusionMatrix(svm.preds.poly, BBBC.test$Choice, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2049  168
##          1   47   36
##                                           
##                Accuracy : 0.9065          
##                  95% CI : (0.8939, 0.9181)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.8013          
##                                           
##                   Kappa : 0.2104          
##                                           
##  Mcnemar's Test P-Value : 2.747e-16       
##                                           
##             Sensitivity : 0.17647         
##             Specificity : 0.97758         
##          Pos Pred Value : 0.43373         
##          Neg Pred Value : 0.92422         
##              Prevalence : 0.08870         
##          Detection Rate : 0.01565         
##    Detection Prevalence : 0.03609         
##       Balanced Accuracy : 0.57702         
##                                           
##        'Positive' Class : 1               
##

The SVM with polynomial kernel has an overall testing accuracy of 91%, with Sensitivity and Specificity of 17.6% and 97.8%, respectively.

SVM with poly kernel profit numbers:

N.Cust <- 2300
N.booksold <- 36
N.mailing <- 47

Profit.PSVM <- N.booksold * (Prof.book) - N.mailing * (VC.mail.posting)
Prof.per.Cust.PSVM <- Profit.PSVM / N.Cust
Profit.PSVM

## [1] 313.25

Prof.per.Cust.PSVM

## [1] 0.1361957

The more complex the SVM gets, the deeper into the hole it digs. At this point BBBC would realize a profit of $313.25 or 0.14 cents per customer from the test set.

For a general mailing campaign, where there is no model used, BBBC would realize a potential profit of $585.80 or 0.25 cents per customer.

DA-6813: Case Study 2 - Bookbinders

M. Haenke, J. Schulze, S. Sutter, T. Tonche

3/05/2021

Exploratory data analysis

LINEAR REGRESSION MODEL

LR with stepwise model selection, based on p-value

LOGIT model

SVM Model