Loading in our libraries
Install package ‘readxl’ to be able to read the excel file in to R
train_data <- read_excel("~/R-Studio/DA6813 Applications/Case Study 2/BBBC-Train.xlsx")
test_data <- read_excel("~/R-Studio/DA6813 Applications/Case Study 2/BBBC-Test.xlsx")
Looking at the structure and first few observations of the data
str(train_data)
## tibble [1,600 × 12] (S3: tbl_df/tbl/data.frame)
## $ Observation : num [1:1600] 1 2 3 4 5 6 7 8 9 10 ...
## $ Choice : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
## $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
## $ Frequency : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
## $ Last_purchase : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
## $ First_purchase : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
## $ P_Child : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
## $ P_Youth : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
## $ P_Cook : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
## $ P_DIY : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
## $ P_Art : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...
Choice: 1 = purchase, 0 = not purchased Gender: 0 = female, 1 = male Amount purchased: Total amount purchased at BBBC Frequency: Amount of purchases within a given time frame Last_Purchase = Most recent purchase in months First_purchase = Months since first purchase in BBBC system Child = purchased children books youth = purchased youth books Cook = cookbooks DIY = DIY Books Art = Art Books
head(train_data)
## # A tibble: 6 × 12
## Observation Choice Gender Amount_purchased Frequency Last_purchase
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 113 8 1
## 2 2 1 1 418 6 11
## 3 3 1 1 336 18 6
## 4 4 1 1 180 16 5
## 5 5 1 0 320 2 3
## 6 6 1 1 268 4 1
## # ℹ 6 more variables: First_purchase <dbl>, P_Child <dbl>, P_Youth <dbl>,
## # P_Cook <dbl>, P_DIY <dbl>, P_Art <dbl>
train_data = train_data %>%
dplyr::select(-Observation)
Removing Observation because it just notes row number which is not useful information Now checking for missing values
anyNA(train_data)
## [1] FALSE
No missing values in this data set.
train_lm = lm(Choice~., data = train_data)
summary(train_lm)
##
## Call:
## lm(formula = Choice ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9603 -0.2462 -0.1161 0.1622 1.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3642284 0.0307411 11.848 < 2e-16 ***
## Gender -0.1309205 0.0200303 -6.536 8.48e-11 ***
## Amount_purchased 0.0002736 0.0001110 2.464 0.0138 *
## Frequency -0.0090868 0.0021791 -4.170 3.21e-05 ***
## Last_purchase 0.0970286 0.0135589 7.156 1.26e-12 ***
## First_purchase -0.0020024 0.0018160 -1.103 0.2704
## P_Child -0.1262584 0.0164011 -7.698 2.41e-14 ***
## P_Youth -0.0963563 0.0201097 -4.792 1.81e-06 ***
## P_Cook -0.1414907 0.0166064 -8.520 < 2e-16 ***
## P_DIY -0.1352313 0.0197873 -6.834 1.17e-11 ***
## P_Art 0.1178494 0.0194427 6.061 1.68e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared: 0.2401, Adjusted R-squared: 0.2353
## F-statistic: 50.2 on 10 and 1589 DF, p-value: < 2.2e-16
This gives us a linear model on the data. This data is meaningless because our focus of this task is for us to find customers who have said “Yes’ which is 1 in our model. Our intercept in the linear regression is at 1.495 which means that when all other variables are 0, then all customers are going to buy the book which makes no sense. When we begin to add our variables to the linear regression, we begin to become closer to 1 with the exception of P_Art, Last_Purchase, and Amount_Purchased leading to an increase to the linear model. This is just junk and we need to run a logistic regression because we care about the odds of a customer purchasing The Art History of Florence.
-ALSO can build a diagnostic plot to show how bad it was violated.
par(mfrow=c(2,2))
plot(train_lm, which=c(1:4))
We can see that in the Residuals vs Fitted plot, that the points should
be randomly distributed but are instead in a parallel line. This is
likely due to our target variable, Choice, being a 0s and 1s only. The
data is also not normally distributed by looking at the QQ plot. In the
Scale-Location plot we can see heteroscedasticity due to how there is
unequal variance and quadratic looking shapes. In the Cook’s Distance,
there are potentially a few influential points.
Now I will change Choice and Gender into factor variables to make R able to run the rest of the code. Also was done on the test data set.
train_data$Choice = as.factor(train_data$Choice)
train_data$Gender = as.factor(train_data$Gender)
test_data$Choice = as.factor(test_data$Choice)
test_data$Gender = as.factor(test_data$Gender)
train_glm1 = glm(Choice ~ . , data = train_data, family = binomial)
summary(train_glm1)
##
## Call:
## glm(formula = Choice ~ ., family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3515281 0.2143839 -1.640 0.1011
## Gender1 -0.8632319 0.1374499 -6.280 3.38e-10 ***
## Amount_purchased 0.0018641 0.0007918 2.354 0.0186 *
## Frequency -0.0755142 0.0165937 -4.551 5.35e-06 ***
## Last_purchase 0.6117713 0.0938127 6.521 6.97e-11 ***
## First_purchase -0.0147792 0.0128027 -1.154 0.2483
## P_Child -0.8112489 0.1167067 -6.951 3.62e-12 ***
## P_Youth -0.6370422 0.1433778 -4.443 8.87e-06 ***
## P_Cook -0.9230066 0.1194814 -7.725 1.12e-14 ***
## P_DIY -0.9058697 0.1437025 -6.304 2.90e-10 ***
## P_Art 0.6861124 0.1270176 5.402 6.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1392.2 on 1589 degrees of freedom
## AIC: 1414.2
##
## Number of Fisher Scoring iterations: 5
Initial AIC is 1414.2, going to check for multicollinearity in the model now.
vif(train_glm1)
## Gender Amount_purchased Frequency Last_purchase
## 1.023359 1.232172 2.490447 17.706670
## First_purchase P_Child P_Youth P_Cook
## 9.247748 2.992269 1.761546 3.229097
## P_DIY P_Art
## 1.992698 1.938089
Last_purchase gave us a VIF of 17.7. This exceeds the optimal cut offs of 5/10 so I will remove it and rerun the logit model.
train_glm2 = glm(Choice ~ .-Last_purchase , data = train_data, family = binomial)
summary(train_glm2)
##
## Call:
## glm(formula = Choice ~ . - Last_purchase, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1489829 0.2095375 -0.711 0.477079
## Gender1 -0.8302649 0.1350384 -6.148 7.83e-10 ***
## Amount_purchased 0.0022691 0.0007747 2.929 0.003399 **
## Frequency -0.1194992 0.0152620 -7.830 4.89e-15 ***
## First_purchase 0.0306235 0.0108454 2.824 0.004748 **
## P_Child -0.3456948 0.0908420 -3.805 0.000142 ***
## P_Youth -0.1789417 0.1226235 -1.459 0.144489
## P_Cook -0.4578299 0.0950443 -4.817 1.46e-06 ***
## P_DIY -0.4265209 0.1209960 -3.525 0.000423 ***
## P_Art 1.0778036 0.1144995 9.413 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1437.0 on 1590 degrees of freedom
## AIC: 1457
##
## Number of Fisher Scoring iterations: 5
AIC went UP to 1457 when removing Last_purchase Possible question: Should we always just go with the lowest AIC model? Checking VIF now
vif(train_glm2)
## Gender Amount_purchased Frequency First_purchase
## 1.021977 1.220305 2.173240 6.886806
## P_Child P_Youth P_Cook P_DIY
## 1.904631 1.320305 2.060140 1.462770
## P_Art
## 1.603865
First_purchase is the highest with 6.88 which gives us the option of removing it if we want our cut off to be at 5. I will remove it and see if AIC goes up again, if it does then I will continue with train_glm2 instead.
train_glm3 = glm(Choice ~ .-Last_purchase-First_purchase , data = train_data, family = binomial)
summary(train_glm3)
##
## Call:
## glm(formula = Choice ~ . - Last_purchase - First_purchase, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.286380 0.202966 -1.411 0.15825
## Gender1 -0.811948 0.134579 -6.033 1.61e-09 ***
## Amount_purchased 0.002406 0.000771 3.120 0.00181 **
## Frequency -0.088625 0.010385 -8.534 < 2e-16 ***
## P_Child -0.194796 0.072207 -2.698 0.00698 **
## P_Youth -0.031928 0.109605 -0.291 0.77082
## P_Cook -0.292392 0.072998 -4.005 6.19e-05 ***
## P_DIY -0.279282 0.108094 -2.584 0.00977 **
## P_Art 1.245842 0.099062 12.576 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1445.0 on 1591 degrees of freedom
## AIC: 1463
##
## Number of Fisher Scoring iterations: 5
AIC went UP again to 1463 so I will just use train_glm2 for the rest of the logistic regression classification.
probabilities = predict(train_glm2,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07913 0.00000 1.00000
pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
Due to our data set being unbalanced we will need to either split our data set into a more balanced set or change our cut off probability for yes/no.
pr_class = ifelse(probabilities>0.23,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1491 60
## 1 605 144
##
## Accuracy : 0.7109
## 95% CI : (0.6919, 0.7293)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1892
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7114
## Specificity : 0.7059
## Pos Pred Value : 0.9613
## Neg Pred Value : 0.1923
## Prevalence : 0.9113
## Detection Rate : 0.6483
## Detection Prevalence : 0.6743
## Balanced Accuracy : 0.7086
##
## 'Positive' Class : 0
##
Accuracy: 0.7109 Sensitivity: 0.7114 Specificity: 0.7059
After getting the optimal cut off, we were able to see that the data was extremely imbalanced so we will calculate the ratio of imbalance.
choicecount = table(train_data$Choice)
choicecount
##
## 0 1
## 1200 400
So there are only 400 customers who would purchase the Art History of Florence out of 1600 total customers in our training data set.We can downsize the data sets in order to balance out the customers who would purchase the book vs ones who would not. This is a possible point to revisit, but I will run an SVM model to see if it will give us better initial results.
Now I will try glm1 because it had the lowest AIC and see if it was a better model.
probabilities = predict(train_glm1,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.09 0.00 1.00
pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
Due to our data set being unbalanced we will need to either split our data set into a more balanced set or change our cut off probability for yes/no.
pr_class = ifelse(probabilities>0.22,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1505 56
## 1 591 148
##
## Accuracy : 0.7187
## 95% CI : (0.6998, 0.737)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2031
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7180
## Specificity : 0.7255
## Pos Pred Value : 0.9641
## Neg Pred Value : 0.2003
## Prevalence : 0.9113
## Detection Rate : 0.6543
## Detection Prevalence : 0.6787
## Balanced Accuracy : 0.7218
##
## 'Positive' Class : 0
##
Accuracy: 0.7187 Sensitivity: 0.7180 Specificity: 0.7255
The glm1 gave slightly better results, but this model produced the best profit
set.seed(1)
tuned = tune.svm(Choice~., data= train_data, gamma = seq(0.01, .1, by = 0.01),
cost = seq(0.1, 1, by = 0.1))
Setting seed to 1 to make the results more reproducible.
tuned$best.parameters
## gamma cost
## 75 0.05 0.8
Our best parameters will be a gamma 0.05 and a cost of 0.8.
tuned$performances
## gamma cost error dispersion
## 1 0.01 0.1 0.250000 0.03461093
## 2 0.02 0.1 0.250000 0.03461093
## 3 0.03 0.1 0.245625 0.03633013
## 4 0.04 0.1 0.242500 0.03665720
## 5 0.05 0.1 0.240000 0.03574602
## 6 0.06 0.1 0.240000 0.03387579
## 7 0.07 0.1 0.239375 0.03461720
## 8 0.08 0.1 0.240625 0.03611445
## 9 0.09 0.1 0.242500 0.03606033
## 10 0.10 0.1 0.243750 0.03560002
## 11 0.01 0.2 0.246875 0.03550846
## 12 0.02 0.2 0.226875 0.04083014
## 13 0.03 0.2 0.218750 0.03608439
## 14 0.04 0.2 0.216875 0.03294698
## 15 0.05 0.2 0.217500 0.02987241
## 16 0.06 0.2 0.221875 0.03244252
## 17 0.07 0.2 0.221250 0.03148523
## 18 0.08 0.2 0.221875 0.03149212
## 19 0.09 0.2 0.221875 0.03149212
## 20 0.10 0.2 0.221250 0.03036239
## 21 0.01 0.3 0.231875 0.03891430
## 22 0.02 0.3 0.216250 0.03309813
## 23 0.03 0.3 0.216250 0.03243583
## 24 0.04 0.3 0.215625 0.03121526
## 25 0.05 0.3 0.218750 0.03104097
## 26 0.06 0.3 0.220000 0.03030516
## 27 0.07 0.3 0.221250 0.03036239
## 28 0.08 0.3 0.220625 0.03034094
## 29 0.09 0.3 0.222500 0.03175973
## 30 0.10 0.3 0.221875 0.03162964
## 31 0.01 0.4 0.220625 0.03536148
## 32 0.02 0.4 0.216875 0.03090785
## 33 0.03 0.4 0.217500 0.03001736
## 34 0.04 0.4 0.218125 0.02982152
## 35 0.05 0.4 0.218750 0.03186887
## 36 0.06 0.4 0.218750 0.03173239
## 37 0.07 0.4 0.220625 0.03076710
## 38 0.08 0.4 0.219375 0.03179388
## 39 0.09 0.4 0.219375 0.03082348
## 40 0.10 0.4 0.220625 0.02902495
## 41 0.01 0.5 0.216250 0.03078826
## 42 0.02 0.5 0.215000 0.02963903
## 43 0.03 0.5 0.217500 0.03001736
## 44 0.04 0.5 0.216875 0.03160218
## 45 0.05 0.5 0.216875 0.03160218
## 46 0.06 0.5 0.217500 0.03170502
## 47 0.07 0.5 0.216250 0.03216710
## 48 0.08 0.5 0.216250 0.03078826
## 49 0.09 0.5 0.219375 0.02832874
## 50 0.10 0.5 0.219375 0.02642633
## 51 0.01 0.6 0.213125 0.02675279
## 52 0.02 0.6 0.215625 0.03022629
## 53 0.03 0.6 0.216250 0.03092891
## 54 0.04 0.6 0.215625 0.03079530
## 55 0.05 0.6 0.215625 0.03079530
## 56 0.06 0.6 0.213125 0.03011119
## 57 0.07 0.6 0.213750 0.03001736
## 58 0.08 0.6 0.218125 0.02755203
## 59 0.09 0.6 0.218750 0.02825971
## 60 0.10 0.6 0.220000 0.02807480
## 61 0.01 0.7 0.214375 0.02701112
## 62 0.02 0.7 0.216250 0.02993047
## 63 0.03 0.7 0.216250 0.03092891
## 64 0.04 0.7 0.215625 0.03051212
## 65 0.05 0.7 0.213750 0.03087272
## 66 0.06 0.7 0.213125 0.03011119
## 67 0.07 0.7 0.217500 0.02598744
## 68 0.08 0.7 0.216875 0.02602916
## 69 0.09 0.7 0.218125 0.02675279
## 70 0.10 0.7 0.219375 0.02848154
## 71 0.01 0.8 0.214375 0.02842052
## 72 0.02 0.8 0.216875 0.03076710
## 73 0.03 0.8 0.216250 0.03050501
## 74 0.04 0.8 0.215000 0.02829041
## 75 0.05 0.8 0.212500 0.02946278
## 76 0.06 0.8 0.215625 0.02655739
## 77 0.07 0.8 0.215000 0.02719528
## 78 0.08 0.8 0.216250 0.02571883
## 79 0.09 0.8 0.218125 0.02786531
## 80 0.10 0.8 0.219375 0.02739405
## 81 0.01 0.9 0.215000 0.02874698
## 82 0.02 0.9 0.217500 0.03016160
## 83 0.03 0.9 0.216250 0.02978511
## 84 0.04 0.9 0.214375 0.02826739
## 85 0.05 0.9 0.212500 0.02810570
## 86 0.06 0.9 0.216250 0.02687419
## 87 0.07 0.9 0.215000 0.02605416
## 88 0.08 0.9 0.216875 0.02602916
## 89 0.09 0.9 0.218750 0.02618709
## 90 0.10 0.9 0.218125 0.02576099
## 91 0.01 1.0 0.214375 0.02857282
## 92 0.02 1.0 0.216875 0.03062571
## 93 0.03 1.0 0.213750 0.02807480
## 94 0.04 1.0 0.213125 0.02996671
## 95 0.05 1.0 0.213750 0.02729087
## 96 0.06 1.0 0.215000 0.02622022
## 97 0.07 1.0 0.215625 0.02589542
## 98 0.08 1.0 0.217500 0.02822897
## 99 0.09 1.0 0.216875 0.02619538
## 100 0.10 1.0 0.217500 0.02479079
mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost)
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.8
##
## Number of Support Vectors: 782
##
## ( 371 411 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)
## test
## pred 0 1
## 0 2054 166
## 1 42 38
caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2054 166
## 1 42 38
##
## Accuracy : 0.9096
## 95% CI : (0.8971, 0.921)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.6327
##
## Kappa : 0.2291
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9800
## Specificity : 0.1863
## Pos Pred Value : 0.9252
## Neg Pred Value : 0.4750
## Prevalence : 0.9113
## Detection Rate : 0.8930
## Detection Prevalence : 0.9652
## Balanced Accuracy : 0.5831
##
## 'Positive' Class : 0
##
Accuracy: 0.9096 Sensitivity: 0.9800 Specificity: 0.1863
The SVM model has given the best Accuracy and Sensitivity out of the other models performed, but we can try to down size the model in order to improve our prior models. We also need to explicitly explain our results.
We can also try to do model selection on our data using the best subset of AIC
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45
book_selling_price = 31.95
response_rate = 0.0903 # this is the variable that needs to change based on the model
profit_per_book = book_selling_price - (company_book_price + overhead)
total_customers = 50000
profit = (total_customers * response_rate) * profit_per_book - (total_customers * cost)
profit
## [1] 13553
Baseline profit from the prior trial that BBBC conducted would net $13553. We will now see how much money our models will give us in profit by using the TP/2300 (total number of testing obs) and then plug that number into the profit formula.
Questions to ask in class regarding the profit formula: Should we only use the amount of people who we predicted accurately or should we use all people who made purchases like it is suggested in the prompt?
We need to proportionally scale the profit by doing 50000/2300 since our test data set has 2300 observations and we need to alter the amount of people we are mailing to based on the amount of people predicted in the model.
My main concern for our formula is that we are factoring out all responses that are 0, which in terms of the question makes sense. In the example that BBBC conducted with 20,000 though, they got 9.03% (~1800) of people who purchased the book. This means that our formula does not align with that one AND I would assume that their number INCLUDES BOTH TP & FP in it instead of just correct predictions. Does our formula work since our main focus is to only send out mail to people who are most likely to make a purchase?
total_customers = 148 + 56 #TP & FP people who purchased
response_rate = 148/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 29934.78
Our response rate is TP/(TP + FP) because this is the amount of people we can accurately predict from our model who purchased the book. We will be eliminating all the other responses from the test set because those people do not make us profit and then we will use a multiplier of 50000/test set size in order to scale up our model.
Currently, $29,934.78 from glm2 is my best profit margin. I will re-balance the data and see if that will change my results.
# Assuming 'df' is your dataframe with 'response' as the column containing 0s and 1s
set.seed(1)
# Calculate the number of instances in each class
counts <- table(train_data$Choice)
# Identify the minority and majority classes
minority_class <- names(counts)[which.min(counts)]
majority_class <- names(counts)[which.max(counts)]
# Determine the number of instances in the minority class
minority_count <- counts[minority_class]
# Undersample the majority class
undersampled_train_data <- train_data %>%
group_by(Choice) %>%
sample_n(minority_count)
# Now undersampled_df contains an equal number of instances from both classes
undersampled_train_data$Choice = as.factor(undersampled_train_data$Choice)
Dr. Roy said we need to balance the test set as well so I will do that now
# Assuming 'df' is your dataframe with 'response' as the column containing 0s and 1s
# Calculate the number of instances in each class
counts <- table(test_data$Choice)
# Identify the minority and majority classes
minority_class <- names(counts)[which.min(counts)]
majority_class <- names(counts)[which.max(counts)]
# Determine the number of instances in the minority class
minority_count <- counts[minority_class]
# Undersample the majority class
undersampled_test_data <- test_data %>%
group_by(Choice) %>%
sample_n(minority_count)
# Now undersampled_df contains an equal number of instances from both classes
undersampled_test_data$Choice = as.factor(undersampled_test_data$Choice)
undersampled_test_size = nrow(undersampled_test_data)
undersampled_test_size
## [1] 408
Creating this to have the number of undersampled test size in a set variable.
und_train_glm1 = glm(Choice ~ . , data = undersampled_train_data, family = binomial)
summary(und_train_glm1)
##
## Call:
## glm(formula = Choice ~ ., family = binomial, data = undersampled_train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7065398 0.2617611 2.699 0.006951 **
## Gender1 -0.8029147 0.1716667 -4.677 2.91e-06 ***
## Amount_purchased 0.0020443 0.0009902 2.064 0.038973 *
## Frequency -0.0793464 0.0204223 -3.885 0.000102 ***
## Last_purchase 0.4578916 0.1160305 3.946 7.94e-05 ***
## First_purchase -0.0067712 0.0160598 -0.422 0.673301
## P_Child -0.6544372 0.1397434 -4.683 2.83e-06 ***
## P_Youth -0.4593219 0.1792462 -2.563 0.010392 *
## P_Cook -0.8477684 0.1525216 -5.558 2.72e-08 ***
## P_DIY -0.8355141 0.1802842 -4.634 3.58e-06 ***
## P_Art 0.8204335 0.1692281 4.848 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1109.04 on 799 degrees of freedom
## Residual deviance: 863.32 on 789 degrees of freedom
## AIC: 885.32
##
## Number of Fisher Scoring iterations: 4
AIC is 885.32 in the first undersampled logistic regression
vif(und_train_glm1)
## Gender Amount_purchased Frequency Last_purchase
## 1.037176 1.277417 2.820257 17.376208
## First_purchase P_Child P_Youth P_Cook
## 9.598640 2.866003 1.812315 3.482302
## P_DIY P_Art
## 1.931103 1.921508
Last_purchase has a VIF of 17 so I will remove it to reduce correlations / multicollinearity between the data
und_train_glm2 = glm(Choice ~ .-Last_purchase , data = undersampled_train_data, family = binomial)
summary(und_train_glm2)
##
## Call:
## glm(formula = Choice ~ . - Last_purchase, family = binomial,
## data = undersampled_train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.874031 0.254078 3.440 0.000582 ***
## Gender1 -0.784309 0.169746 -4.620 3.83e-06 ***
## Amount_purchased 0.002319 0.000977 2.373 0.017635 *
## Frequency -0.114884 0.018296 -6.279 3.41e-10 ***
## First_purchase 0.028265 0.013196 2.142 0.032198 *
## P_Child -0.323674 0.108465 -2.984 0.002844 **
## P_Youth -0.081304 0.149327 -0.544 0.586118
## P_Cook -0.482244 0.118626 -4.065 4.80e-05 ***
## P_DIY -0.502835 0.155436 -3.235 0.001216 **
## P_Art 1.131269 0.151564 7.464 8.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1109.04 on 799 degrees of freedom
## Residual deviance: 880.11 on 790 degrees of freedom
## AIC: 900.11
##
## Number of Fisher Scoring iterations: 4
AIC 900.11
vif(und_train_glm2)
## Gender Amount_purchased Frequency First_purchase
## 1.037407 1.278347 2.329260 6.516841
## P_Child P_Youth P_Cook P_DIY
## 1.809505 1.270922 2.107647 1.475285
## P_Art
## 1.533222
First_purchase has a VIF of 6 so we could potentially take it out, I will check my AIC score on the next model with Last & First Purchase removed to see if it is worth removing.
und_train_glm3 = glm(Choice ~ .-Last_purchase-First_purchase ,
data = undersampled_train_data, family = binomial)
summary(und_train_glm3)
##
## Call:
## glm(formula = Choice ~ . - Last_purchase - First_purchase, family = binomial,
## data = undersampled_train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.754020 0.246809 3.055 0.002250 **
## Gender1 -0.772564 0.169374 -4.561 5.08e-06 ***
## Amount_purchased 0.002431 0.000973 2.499 0.012471 *
## Frequency -0.085913 0.012039 -7.136 9.61e-13 ***
## P_Child -0.191758 0.088457 -2.168 0.030173 *
## P_Youth 0.045496 0.135068 0.337 0.736242
## P_Cook -0.321962 0.090464 -3.559 0.000372 ***
## P_DIY -0.360495 0.140322 -2.569 0.010198 *
## P_Art 1.282755 0.135910 9.438 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1109.04 on 799 degrees of freedom
## Residual deviance: 884.76 on 791 degrees of freedom
## AIC: 902.76
##
## Number of Fisher Scoring iterations: 4
AIC went up to 902.76. I can just run all 3 models and just see what happens.
probabilities = predict(und_train_glm1,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4583 1.0000 1.0000
pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
pr_class = ifelse(probabilities>0.46,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 146 57
## 1 58 147
##
## Accuracy : 0.7181
## 95% CI : (0.6718, 0.7613)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4363
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7157
## Specificity : 0.7206
## Pos Pred Value : 0.7192
## Neg Pred Value : 0.7171
## Prevalence : 0.5000
## Detection Rate : 0.3578
## Detection Prevalence : 0.4975
## Balanced Accuracy : 0.7181
##
## 'Positive' Class : 0
##
Acc: 0.7181 Sens: 0.7157 Spec: 0.7206 TP: 147 FP: 57
probabilities = predict(und_train_glm2,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4681 1.0000 1.0000
pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
pr_class = ifelse(probabilities>0.47,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 147 57
## 1 57 147
##
## Accuracy : 0.7206
## 95% CI : (0.6743, 0.7636)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4412
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7206
## Specificity : 0.7206
## Pos Pred Value : 0.7206
## Neg Pred Value : 0.7206
## Prevalence : 0.5000
## Detection Rate : 0.3603
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.7206
##
## 'Positive' Class : 0
##
Acc: 0.7206 Sens: 0.7206 Spec: 0.7206 TP: 147 FP: 57
probabilities = predict(und_train_glm3,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.473 1.000 1.000
pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
pr_class = ifelse(probabilities>0.48,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 144 59
## 1 60 145
##
## Accuracy : 0.7083
## 95% CI : (0.6616, 0.752)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4167
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7059
## Specificity : 0.7108
## Pos Pred Value : 0.7094
## Neg Pred Value : 0.7073
## Prevalence : 0.5000
## Detection Rate : 0.3529
## Detection Prevalence : 0.4975
## Balanced Accuracy : 0.7083
##
## 'Positive' Class : 0
##
Acc: 0.7083 Sens: 0.7059 Spec: 0.7108 TP: 145 FP: 59
All 3 of these models had the same TP and FP rate except for Undersampled Logit 3 but I will plug in the best two into the formula below
total_customers = 147 + 57 #TP & FP people who purchased
response_rate = 147/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 29713.04
When changing the multiplier to 50000/408 when we down sampled the test set, current profit would now be $167,000. But we will keep the multiplier of 50000/2300 because it would seem unrealistic to have 70% of the people of the Midwest people purchase the book.
set.seed(1)
tuned = tune.svm(Choice~., data= undersampled_train_data, gamma = seq(0.01, .1, by = 0.01),
cost = seq(0.1, 1, by = 0.1))
Setting seed to 1 to make the results more reproducible.
tuned$best.parameters
## gamma cost
## 71 0.01 0.8
Our best parameters will be a gamma 0.01 and a cost of 0.8.
tuned$performances
## gamma cost error dispersion
## 1 0.01 0.1 0.34875 0.03408018
## 2 0.02 0.1 0.33000 0.03184162
## 3 0.03 0.1 0.32375 0.03653860
## 4 0.04 0.1 0.32500 0.03996526
## 5 0.05 0.1 0.31750 0.04005205
## 6 0.06 0.1 0.31375 0.03653860
## 7 0.07 0.1 0.31125 0.03972562
## 8 0.08 0.1 0.31250 0.04409586
## 9 0.09 0.1 0.31000 0.04743416
## 10 0.10 0.1 0.31375 0.04505013
## 11 0.01 0.2 0.32125 0.03775377
## 12 0.02 0.2 0.31375 0.04267529
## 13 0.03 0.2 0.31250 0.04330127
## 14 0.04 0.2 0.31250 0.03864008
## 15 0.05 0.2 0.31250 0.04082483
## 16 0.06 0.2 0.31500 0.04362084
## 17 0.07 0.2 0.31750 0.04571956
## 18 0.08 0.2 0.31750 0.04495368
## 19 0.09 0.2 0.31750 0.04866267
## 20 0.10 0.2 0.31250 0.04639804
## 21 0.01 0.3 0.31125 0.04059026
## 22 0.02 0.3 0.30375 0.04450733
## 23 0.03 0.3 0.31500 0.04199868
## 24 0.04 0.3 0.31875 0.04573854
## 25 0.05 0.3 0.31875 0.04759858
## 26 0.06 0.3 0.31625 0.04931827
## 27 0.07 0.3 0.31500 0.05230785
## 28 0.08 0.3 0.31375 0.05050096
## 29 0.09 0.3 0.31125 0.05152197
## 30 0.10 0.3 0.30250 0.05130248
## 31 0.01 0.4 0.31375 0.04543387
## 32 0.02 0.4 0.31000 0.04556741
## 33 0.03 0.4 0.32000 0.04758034
## 34 0.04 0.4 0.31375 0.04803428
## 35 0.05 0.4 0.31000 0.05197489
## 36 0.06 0.4 0.31125 0.05219155
## 37 0.07 0.4 0.31125 0.05573063
## 38 0.08 0.4 0.31125 0.05185785
## 39 0.09 0.4 0.30375 0.05304937
## 40 0.10 0.4 0.30125 0.04839436
## 41 0.01 0.5 0.30250 0.04851976
## 42 0.02 0.5 0.31125 0.04427267
## 43 0.03 0.5 0.31250 0.05651942
## 44 0.04 0.5 0.31000 0.05130248
## 45 0.05 0.5 0.31000 0.05394184
## 46 0.06 0.5 0.30625 0.05781015
## 47 0.07 0.5 0.30875 0.05714565
## 48 0.08 0.5 0.30500 0.05502525
## 49 0.09 0.5 0.30625 0.05376453
## 50 0.10 0.5 0.30500 0.05143766
## 51 0.01 0.6 0.30125 0.04543387
## 52 0.02 0.6 0.31500 0.05296750
## 53 0.03 0.6 0.31000 0.05886661
## 54 0.04 0.6 0.31375 0.05696307
## 55 0.05 0.6 0.30875 0.05591723
## 56 0.06 0.6 0.31125 0.05573063
## 57 0.07 0.6 0.31250 0.05204165
## 58 0.08 0.6 0.30625 0.06187184
## 59 0.09 0.6 0.30500 0.06433420
## 60 0.10 0.6 0.30250 0.05737305
## 61 0.01 0.7 0.30250 0.04479893
## 62 0.02 0.7 0.31000 0.06476453
## 63 0.03 0.7 0.31625 0.05684103
## 64 0.04 0.7 0.31125 0.05964304
## 65 0.05 0.7 0.30875 0.05775006
## 66 0.06 0.7 0.31000 0.05329426
## 67 0.07 0.7 0.31000 0.05945353
## 68 0.08 0.7 0.30375 0.06010696
## 69 0.09 0.7 0.30375 0.05864500
## 70 0.10 0.7 0.30500 0.05839283
## 71 0.01 0.8 0.30125 0.03972562
## 72 0.02 0.8 0.31125 0.06652328
## 73 0.03 0.8 0.31375 0.05478810
## 74 0.04 0.8 0.30875 0.06265259
## 75 0.05 0.8 0.31000 0.05886661
## 76 0.06 0.8 0.31375 0.05787019
## 77 0.07 0.8 0.30500 0.05627314
## 78 0.08 0.8 0.30500 0.05957022
## 79 0.09 0.8 0.30625 0.05628857
## 80 0.10 0.8 0.30500 0.05658082
## 81 0.01 0.9 0.30125 0.04016027
## 82 0.02 0.9 0.31000 0.06003471
## 83 0.03 0.9 0.30750 0.06593347
## 84 0.04 0.9 0.31000 0.06341004
## 85 0.05 0.9 0.31250 0.05863020
## 86 0.06 0.9 0.31000 0.05197489
## 87 0.07 0.9 0.30375 0.05714565
## 88 0.08 0.9 0.30500 0.06129392
## 89 0.09 0.9 0.30875 0.05434266
## 90 0.10 0.9 0.30750 0.05927806
## 91 0.01 1.0 0.30625 0.04497299
## 92 0.02 1.0 0.31250 0.05951190
## 93 0.03 1.0 0.31000 0.06202598
## 94 0.04 1.0 0.31625 0.06375136
## 95 0.05 1.0 0.31125 0.06022239
## 96 0.06 1.0 0.30625 0.05179085
## 97 0.07 1.0 0.30625 0.05781015
## 98 0.08 1.0 0.30375 0.05981743
## 99 0.09 1.0 0.31000 0.05945353
## 100 0.10 1.0 0.31000 0.05945353
mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost)
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.8
##
## Number of Support Vectors: 612
##
## ( 307 305 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)
## test
## pred 0 1
## 0 160 77
## 1 44 127
caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 160 77
## 1 44 127
##
## Accuracy : 0.7034
## 95% CI : (0.6565, 0.7474)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4069
##
## Mcnemar's Test P-Value : 0.003625
##
## Sensitivity : 0.7843
## Specificity : 0.6225
## Pos Pred Value : 0.6751
## Neg Pred Value : 0.7427
## Prevalence : 0.5000
## Detection Rate : 0.3922
## Detection Prevalence : 0.5809
## Balanced Accuracy : 0.7034
##
## 'Positive' Class : 0
##
Acc: 0.7822 Sens: 0.7977 Spec: 0.6225 TP: 127 FP: 77
total_customers = 127 + 77 #TP & FP people who purchased
response_rate = 127/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 25278.26
Now we lost ~$4000… I will just focus on the odds ratio to see who the ideal customer is for now before making more possible models.
Our best profit would be coming from the undersampled logits 1 & 2.
The best profit model (so far) seems to be glm1 so I will focus on the odds ratio pertaining to that model.
round(exp(train_glm2$coefficients),3)
## (Intercept) Gender1 Amount_purchased Frequency
## 0.862 0.436 1.002 0.887
## First_purchase P_Child P_Youth P_Cook
## 1.031 0.708 0.836 0.633
## P_DIY P_Art
## 0.653 2.938
Gender1: 0.436 0 = female, 1 = male Amount_purchased: 1.002 Frequency: 0.887 First_purchase: 1.031 P_Child: 0.708 P_Youth: 0.836 P_Cook: 0.633 P_DIY: 0.653 P_Art: 2.938
This means that if someone is a female they are 137% more likely to buy the target book. These odds ratios are if all other variables are held constant
A one unit increase in Amount_purchase is 0.002% increased odds of someone purchasing our target book. A one unit increase in P_Art is a 193.8% increased odds of someone purchasing our target book. For every one unit decrease in Frequency, they are 11.3% less likely to purchase the target book. For every one unit decrease in First_purchase months, they are 3.1% less likely to purchase our target book. For every one unit decrease in P_Child, they are ~29.2% less likely to purchase our target book. For every one unit decrease in P_Youth, they are 16.4% less likely to purchase our target book. For every one unit decrease in P_Cook, they are 36.7% less likely to purchase our target book. For every one unit decrease in P_DIY, they are 34.7% less likely to purchase our target book. The odds of females purchasing the target book is 2.29 compared to the male 0.436. This means that if someone is a female they are 129% more likely to buy the target book.
These odds ratios are if all other variables are held constant. The type of customer most likely to purchase our target book would be a female who has purchased art books frequently from BBBC recently compared to a male who does not purchase art books and has not made a purchase recently.
model.null = glm(Choice ~ 1, data=train_data, family = binomial) # null model : no predictor
model.full = glm(Choice ~ ., data=train_data, family = binomial) # full model: all predictors
step.models.AIC<-step(model.null, scope = list(upper=model.full),
direction="both",test="Chisq", trace = F)
summary(step.models.AIC)
##
## Call:
## glm(formula = Choice ~ P_Art + Frequency + Gender + P_Cook +
## P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth,
## family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2833949 0.2062721 -1.374 0.1695
## P_Art 0.6643371 0.1255243 5.292 1.21e-07 ***
## Frequency -0.0903261 0.0106304 -8.497 < 2e-16 ***
## Gender1 -0.8660575 0.1373268 -6.307 2.85e-10 ***
## P_Cook -0.9330131 0.1190073 -7.840 4.51e-15 ***
## P_DIY -0.9101106 0.1433591 -6.348 2.17e-10 ***
## Amount_purchased 0.0018357 0.0007908 2.321 0.0203 *
## P_Child -0.8181807 0.1163377 -7.033 2.02e-12 ***
## Last_purchase 0.5536689 0.0784519 7.057 1.70e-12 ***
## P_Youth -0.6424923 0.1432548 -4.485 7.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1799.5 on 1599 degrees of freedom
## Residual deviance: 1393.5 on 1590 degrees of freedom
## AIC: 1413.5
##
## Number of Fisher Scoring iterations: 5
Looks like we lost First_purchase
probabilities = predict(step.models.AIC,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
summary(predicted.classes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08696 0.00000 1.00000
pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)
pr_class = ifelse(probabilities>0.22,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1503 56
## 1 593 148
##
## Accuracy : 0.7178
## 95% CI : (0.6989, 0.7361)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2023
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7171
## Specificity : 0.7255
## Pos Pred Value : 0.9641
## Neg Pred Value : 0.1997
## Prevalence : 0.9113
## Detection Rate : 0.6535
## Detection Prevalence : 0.6778
## Balanced Accuracy : 0.7213
##
## 'Positive' Class : 0
##
Acc: 0.71718 Sens: 0.7171 Spec: 0.7255 TP: 148 FP : 56
train_lda = lda(Choice ~ ., data = train_data)
train_lda
## Call:
## lda(Choice ~ ., data = train_data)
##
## Prior probabilities of groups:
## 0 1
## 0.75 0.25
##
## Group means:
## Gender1 Amount_purchased Frequency Last_purchase First_purchase P_Child
## 0 0.6975 194.4167 13.33667 2.951667 22.54667 0.7341667
## 1 0.5425 220.4125 9.24500 3.940000 22.66500 0.7550000
## P_Youth P_Cook P_DIY P_Art
## 0 0.3275 0.7841667 0.3933333 0.2733333
## 1 0.3675 0.6875000 0.3850000 0.8800000
##
## Coefficients of linear discriminants:
## LD1
## Gender1 -0.707433464
## Amount_purchased 0.001478478
## Frequency -0.049100848
## Last_purchase 0.524297499
## First_purchase -0.010820025
## P_Child -0.682241865
## P_Youth -0.520664659
## P_Cook -0.764550263
## P_DIY -0.730726880
## P_Art 0.636803402
p1 <- predict(train_lda, train_data)$class
tab <- table(Predicted = p1, Actual = train_data$Choice)
tab
## Actual
## Predicted 0 1
## 0 1125 243
## 1 75 157
sum(diag(tab))/sum(tab)
## [1] 0.80125
p2 <- predict(train_lda, test_data)$class
tab1 <- table(Predicted = p2, Actual = test_data$Choice)
tab1
## Actual
## Predicted 0 1
## 0 1968 127
## 1 128 77
sum(diag(tab1))/sum(tab1)
## [1] 0.8891304
Acc: 0.8891 Sens: 0.3756
Spec: 0.9394 TP: 77 FP: 127 This model will perform poorly on profit due
to many FPs.
2/26/24: I tried to look into changing the 0,1 but I am pretty sure they are correct. That was just based on feedback he gave to another group in class but I had high accuracy just low specificity. I have gone back and updated the profit formula for the under-sampled groups due to the change in the test data set being under-sampled based on his feedback from class. This pretty much increased profits by 500% since our original test data set was 2300 and when under-sampling it became 408. This led to under-sampled Logit 1 & 2 being the best models at $167,500 profit but I am unsure if I feel this would be correct.
I feel the only ways to attempt to improve the current models would be to create another model with differing features but every time I have removed variables AIC has gone down and the models usually become slightly worse by gaining a few False Positives.
Currently, the profit formula is the following:
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 127 + 77 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 127/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 25278.26
After adjusting the formula to only use the multiplier of 50000/2300 because it is unrealistic to expect a 72% purchase rate from the random Midwest BBBC subscribers, we then will use Logit 2 as our best model.The best model is $29,934.78 from glm2 which is comparable to under-sampled glm1 with a profit of $29,713.04.
total_profit = c(29047.83,29934.78, 29713.04 , 29713.04 , 29269.57, 25278.26,
27273.91, 25056.52, 24834.78) #profit per model
model_seq = c('glm1', 'glm2', 'und_glm1','und_glm2', 'und_glm3', 'und_svm_radial',
'und_svm_linear', 'und_svm_mononomial', 'und_svm_sigmoid') #using this as a y variable for graph
df = data.frame(total_profit,model_seq)
# barchart with added parameters
barplot(total_profit,
main = "Total Profit Generated Per Model",
xlab = "Model Name",
ylab = "Profit in $",
names.arg = model_seq,
col = "darkred")
my_colors <- c("#FF5733", "#33FF57", "#3366FF", "#FF33FF")
library(viridis)
## Loading required package: viridisLite
dfplot = ggplot(data=df,
aes(x=model_seq, y = total_profit, fill = model_seq)) +
geom_col(color = 'black', show.legend = F)+
geom_text(aes(label=total_profit), hjust = 1)+
labs(title = "Total Profit Generated Per Model",
x = 'Model Name',
y = 'Profit in $' ) +
scale_fill_viridis_d(option = "D", end = 0.9) +
theme_minimal()
dfplot + coord_flip()
Excluding svm because it performed poorly on profit (~$5500 profit which
is worse than our baseline of ~$13000) and made the graph look ugly.
mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'linear')
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.8
##
## Number of Support Vectors: 519
##
## ( 259 260 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)
## test
## pred 0 1
## 0 158 68
## 1 46 136
caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 158 68
## 1 46 136
##
## Accuracy : 0.7206
## 95% CI : (0.6743, 0.7636)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4412
##
## Mcnemar's Test P-Value : 0.0492
##
## Sensitivity : 0.7745
## Specificity : 0.6667
## Pos Pred Value : 0.6991
## Neg Pred Value : 0.7473
## Prevalence : 0.5000
## Detection Rate : 0.3873
## Detection Prevalence : 0.5539
## Balanced Accuracy : 0.7206
##
## 'Positive' Class : 0
##
Under-sampled SVM Linear Kernel Acc: 0.7206 Sens: 0.7745 Spec: 0.6667 TP: 136 FP: 68
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 136 + 68 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 136/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 27273.91
mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'polynomial', degree = 1)
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "polynomial", degree = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.8
## degree: 1
## coef.0: 0
##
## Number of Support Vectors: 648
##
## ( 323 325 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)
## test
## pred 0 1
## 0 163 78
## 1 41 126
caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 163 78
## 1 41 126
##
## Accuracy : 0.7083
## 95% CI : (0.6616, 0.752)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4167
##
## Mcnemar's Test P-Value : 0.0009665
##
## Sensitivity : 0.7990
## Specificity : 0.6176
## Pos Pred Value : 0.6763
## Neg Pred Value : 0.7545
## Prevalence : 0.5000
## Detection Rate : 0.3995
## Detection Prevalence : 0.5907
## Balanced Accuracy : 0.7083
##
## 'Positive' Class : 0
##
SVM Mono-nomial Kernel Acc: 0.7083 Sens: 0.7990 Spec: 0.6176 TP: 126 FP: 78
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 126 + 78 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 126/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 25056.52
mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'sigmoid')
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 0.8
## coef.0: 0
##
## Number of Support Vectors: 647
##
## ( 323 324 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)
## test
## pred 0 1
## 0 163 79
## 1 41 125
caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 163 79
## 1 41 125
##
## Accuracy : 0.7059
## 95% CI : (0.6591, 0.7497)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4118
##
## Mcnemar's Test P-Value : 0.0007312
##
## Sensitivity : 0.7990
## Specificity : 0.6127
## Pos Pred Value : 0.6736
## Neg Pred Value : 0.7530
## Prevalence : 0.5000
## Detection Rate : 0.3995
## Detection Prevalence : 0.5931
## Balanced Accuracy : 0.7059
##
## 'Positive' Class : 0
##
Undersampled SVM Sigmoid Kernel Acc: 0.7059 Sens: 0.7990 Spec: 0.6127 TP: 125 FP: 79
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 125 + 79 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 125/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 24834.78
mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'linear')
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.8
##
## Number of Support Vectors: 737
##
## ( 367 370 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)
## test
## pred 0 1
## 0 2011 147
## 1 85 57
caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2011 147
## 1 85 57
##
## Accuracy : 0.8991
## 95% CI : (0.8861, 0.9111)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.9802
##
## Kappa : 0.2768
##
## Mcnemar's Test P-Value : 6.206e-05
##
## Sensitivity : 0.9594
## Specificity : 0.2794
## Pos Pred Value : 0.9319
## Neg Pred Value : 0.4014
## Prevalence : 0.9113
## Detection Rate : 0.8743
## Detection Prevalence : 0.9383
## Balanced Accuracy : 0.6194
##
## 'Positive' Class : 0
##
SVM Linear Kernel Acc: 0.8991 Sens: 0.9594 Spec: 0.2794 TP: 57 FP: 147
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 57 + 147 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 57/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 9756.522
mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'polynomial', degree = 1)
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "polynomial", degree = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.8
## degree: 1
## coef.0: 0
##
## Number of Support Vectors: 793
##
## ( 395 398 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)
## test
## pred 0 1
## 0 2070 176
## 1 26 28
caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2070 176
## 1 26 28
##
## Accuracy : 0.9122
## 95% CI : (0.8999, 0.9234)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.4601
##
## Kappa : 0.1869
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9876
## Specificity : 0.1373
## Pos Pred Value : 0.9216
## Neg Pred Value : 0.5185
## Prevalence : 0.9113
## Detection Rate : 0.9000
## Detection Prevalence : 0.9765
## Balanced Accuracy : 0.5624
##
## 'Positive' Class : 0
##
SVM Mono-nomial Kernel Acc: 0.9122 Sens: 0.9876 Spec: 0.1374 TP: 28 FP: 176
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 28 + 176 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 28/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 3326.087
mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, kernel = 'sigmoid')
summary(mysvm)
##
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost, kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 0.8
## coef.0: 0
##
## Number of Support Vectors: 794
##
## ( 395 399 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)
## test
## pred 0 1
## 0 2070 176
## 1 26 28
caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2070 176
## 1 26 28
##
## Accuracy : 0.9122
## 95% CI : (0.8999, 0.9234)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 0.4601
##
## Kappa : 0.1869
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9876
## Specificity : 0.1373
## Pos Pred Value : 0.9216
## Neg Pred Value : 0.5185
## Prevalence : 0.9113
## Detection Rate : 0.9000
## Detection Prevalence : 0.9765
## Balanced Accuracy : 0.5624
##
## 'Positive' Class : 0
##
SVM Sigmoid Kernel Acc: 0.9122 Sens: 0.9876 Spec: 0.1373 TP: 28 FP: 176
cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 28 + 176 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 28/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit
## [1] 3326.087