Loading in our libraries

Install package ‘readxl’ to be able to read the excel file in to R

train_data <- read_excel("~/R-Studio/DA6813 Applications/Case Study 2/BBBC-Train.xlsx")
test_data <- read_excel("~/R-Studio/DA6813 Applications/Case Study 2/BBBC-Test.xlsx")

Looking at the structure and first few observations of the data

str(train_data)

## tibble [1,600 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Observation     : num [1:1600] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Choice          : num [1:1600] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : num [1:1600] 1 1 1 1 0 1 1 0 1 1 ...
##  $ Amount_purchased: num [1:1600] 113 418 336 180 320 268 198 280 393 138 ...
##  $ Frequency       : num [1:1600] 8 6 18 16 2 4 2 6 12 10 ...
##  $ Last_purchase   : num [1:1600] 1 11 6 5 3 1 12 2 11 7 ...
##  $ First_purchase  : num [1:1600] 8 66 32 42 18 4 62 12 50 38 ...
##  $ P_Child         : num [1:1600] 0 0 2 2 0 0 2 0 3 2 ...
##  $ P_Youth         : num [1:1600] 1 2 0 0 0 0 3 2 0 3 ...
##  $ P_Cook          : num [1:1600] 0 3 1 0 0 0 2 0 3 0 ...
##  $ P_DIY           : num [1:1600] 0 2 1 1 1 0 1 0 0 0 ...
##  $ P_Art           : num [1:1600] 0 3 2 1 2 0 2 0 2 1 ...

Choice: 1 = purchase, 0 = not purchased Gender: 0 = female, 1 = male Amount purchased: Total amount purchased at BBBC Frequency: Amount of purchases within a given time frame Last_Purchase = Most recent purchase in months First_purchase = Months since first purchase in BBBC system Child = purchased children books youth = purchased youth books Cook = cookbooks DIY = DIY Books Art = Art Books

head(train_data)

## # A tibble: 6 × 12
##   Observation Choice Gender Amount_purchased Frequency Last_purchase
##         <dbl>  <dbl>  <dbl>            <dbl>     <dbl>         <dbl>
## 1           1      1      1              113         8             1
## 2           2      1      1              418         6            11
## 3           3      1      1              336        18             6
## 4           4      1      1              180        16             5
## 5           5      1      0              320         2             3
## 6           6      1      1              268         4             1
## # ℹ 6 more variables: First_purchase <dbl>, P_Child <dbl>, P_Youth <dbl>,
## #   P_Cook <dbl>, P_DIY <dbl>, P_Art <dbl>

train_data = train_data %>% 
  dplyr::select(-Observation)

Removing Observation because it just notes row number which is not useful information Now checking for missing values

anyNA(train_data)

## [1] FALSE

No missing values in this data set.

Linear Regression Model

train_lm = lm(Choice~., data = train_data)
summary(train_lm)

## 
## Call:
## lm(formula = Choice ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9603 -0.2462 -0.1161  0.1622  1.0588 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.3642284  0.0307411  11.848  < 2e-16 ***
## Gender           -0.1309205  0.0200303  -6.536 8.48e-11 ***
## Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
## Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
## Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
## First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
## P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
## P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
## P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
## P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
## P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2353 
## F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16

This gives us a linear model on the data. This data is meaningless because our focus of this task is for us to find customers who have said “Yes’ which is 1 in our model. Our intercept in the linear regression is at 1.495 which means that when all other variables are 0, then all customers are going to buy the book which makes no sense. When we begin to add our variables to the linear regression, we begin to become closer to 1 with the exception of P_Art, Last_Purchase, and Amount_Purchased leading to an increase to the linear model. This is just junk and we need to run a logistic regression because we care about the odds of a customer purchasing The Art History of Florence.

-ALSO can build a diagnostic plot to show how bad it was violated.

par(mfrow=c(2,2))
plot(train_lm, which=c(1:4))

We can see that in the Residuals vs Fitted plot, that the points should be randomly distributed but are instead in a parallel line. This is likely due to our target variable, Choice, being a 0s and 1s only. The data is also not normally distributed by looking at the QQ plot. In the Scale-Location plot we can see heteroscedasticity due to how there is unequal variance and quadratic looking shapes. In the Cook’s Distance, there are potentially a few influential points.

Logistic Regression Model

Now I will change Choice and Gender into factor variables to make R able to run the rest of the code. Also was done on the test data set.

train_data$Choice = as.factor(train_data$Choice)
train_data$Gender = as.factor(train_data$Gender)
test_data$Choice = as.factor(test_data$Choice)
test_data$Gender = as.factor(test_data$Gender)

train_glm1 = glm(Choice ~ . , data = train_data, family = binomial)
summary(train_glm1)

## 
## Call:
## glm(formula = Choice ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.3515281  0.2143839  -1.640   0.1011    
## Gender1          -0.8632319  0.1374499  -6.280 3.38e-10 ***
## Amount_purchased  0.0018641  0.0007918   2.354   0.0186 *  
## Frequency        -0.0755142  0.0165937  -4.551 5.35e-06 ***
## Last_purchase     0.6117713  0.0938127   6.521 6.97e-11 ***
## First_purchase   -0.0147792  0.0128027  -1.154   0.2483    
## P_Child          -0.8112489  0.1167067  -6.951 3.62e-12 ***
## P_Youth          -0.6370422  0.1433778  -4.443 8.87e-06 ***
## P_Cook           -0.9230066  0.1194814  -7.725 1.12e-14 ***
## P_DIY            -0.9058697  0.1437025  -6.304 2.90e-10 ***
## P_Art             0.6861124  0.1270176   5.402 6.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1392.2  on 1589  degrees of freedom
## AIC: 1414.2
## 
## Number of Fisher Scoring iterations: 5

Initial AIC is 1414.2, going to check for multicollinearity in the model now.

vif(train_glm1)

##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.023359         1.232172         2.490447        17.706670 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.247748         2.992269         1.761546         3.229097 
##            P_DIY            P_Art 
##         1.992698         1.938089

Last_purchase gave us a VIF of 17.7. This exceeds the optimal cut offs of 5/10 so I will remove it and rerun the logit model.

train_glm2 = glm(Choice ~ .-Last_purchase , data = train_data, family = binomial)
summary(train_glm2)

## 
## Call:
## glm(formula = Choice ~ . - Last_purchase, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.1489829  0.2095375  -0.711 0.477079    
## Gender1          -0.8302649  0.1350384  -6.148 7.83e-10 ***
## Amount_purchased  0.0022691  0.0007747   2.929 0.003399 ** 
## Frequency        -0.1194992  0.0152620  -7.830 4.89e-15 ***
## First_purchase    0.0306235  0.0108454   2.824 0.004748 ** 
## P_Child          -0.3456948  0.0908420  -3.805 0.000142 ***
## P_Youth          -0.1789417  0.1226235  -1.459 0.144489    
## P_Cook           -0.4578299  0.0950443  -4.817 1.46e-06 ***
## P_DIY            -0.4265209  0.1209960  -3.525 0.000423 ***
## P_Art             1.0778036  0.1144995   9.413  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1437.0  on 1590  degrees of freedom
## AIC: 1457
## 
## Number of Fisher Scoring iterations: 5

AIC went UP to 1457 when removing Last_purchase Possible question: Should we always just go with the lowest AIC model? Checking VIF now

vif(train_glm2)

##           Gender Amount_purchased        Frequency   First_purchase 
##         1.021977         1.220305         2.173240         6.886806 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.904631         1.320305         2.060140         1.462770 
##            P_Art 
##         1.603865

First_purchase is the highest with 6.88 which gives us the option of removing it if we want our cut off to be at 5. I will remove it and see if AIC goes up again, if it does then I will continue with train_glm2 instead.

train_glm3 = glm(Choice ~ .-Last_purchase-First_purchase , data = train_data, family = binomial)
summary(train_glm3)

## 
## Call:
## glm(formula = Choice ~ . - Last_purchase - First_purchase, family = binomial, 
##     data = train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.286380   0.202966  -1.411  0.15825    
## Gender1          -0.811948   0.134579  -6.033 1.61e-09 ***
## Amount_purchased  0.002406   0.000771   3.120  0.00181 ** 
## Frequency        -0.088625   0.010385  -8.534  < 2e-16 ***
## P_Child          -0.194796   0.072207  -2.698  0.00698 ** 
## P_Youth          -0.031928   0.109605  -0.291  0.77082    
## P_Cook           -0.292392   0.072998  -4.005 6.19e-05 ***
## P_DIY            -0.279282   0.108094  -2.584  0.00977 ** 
## P_Art             1.245842   0.099062  12.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1445.0  on 1591  degrees of freedom
## AIC: 1463
## 
## Number of Fisher Scoring iterations: 5

AIC went UP again to 1463 so I will just use train_glm2 for the rest of the logistic regression classification.

probabilities = predict(train_glm2,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07913 0.00000 1.00000

pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

Due to our data set being unbalanced we will need to either split our data set into a more balanced set or change our cut off probability for yes/no.

pr_class = ifelse(probabilities>0.23,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1491   60
##          1  605  144
##                                           
##                Accuracy : 0.7109          
##                  95% CI : (0.6919, 0.7293)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1892          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7114          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.9613          
##          Neg Pred Value : 0.1923          
##              Prevalence : 0.9113          
##          Detection Rate : 0.6483          
##    Detection Prevalence : 0.6743          
##       Balanced Accuracy : 0.7086          
##                                           
##        'Positive' Class : 0               
##

Logistic Regression 1 Results

Accuracy: 0.7109 Sensitivity: 0.7114 Specificity: 0.7059

After getting the optimal cut off, we were able to see that the data was extremely imbalanced so we will calculate the ratio of imbalance.

choicecount = table(train_data$Choice)
choicecount

## 
##    0    1 
## 1200  400

So there are only 400 customers who would purchase the Art History of Florence out of 1600 total customers in our training data set.We can downsize the data sets in order to balance out the customers who would purchase the book vs ones who would not. This is a possible point to revisit, but I will run an SVM model to see if it will give us better initial results.

Logistic Regression Results 2

Now I will try glm1 because it had the lowest AIC and see if it was a better model.

probabilities = predict(train_glm1,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.09    0.00    1.00

pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

Due to our data set being unbalanced we will need to either split our data set into a more balanced set or change our cut off probability for yes/no.

pr_class = ifelse(probabilities>0.22,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1505   56
##          1  591  148
##                                          
##                Accuracy : 0.7187         
##                  95% CI : (0.6998, 0.737)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2031         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7180         
##             Specificity : 0.7255         
##          Pos Pred Value : 0.9641         
##          Neg Pred Value : 0.2003         
##              Prevalence : 0.9113         
##          Detection Rate : 0.6543         
##    Detection Prevalence : 0.6787         
##       Balanced Accuracy : 0.7218         
##                                          
##        'Positive' Class : 0              
##

Logistic Regression 2 Results

Accuracy: 0.7187 Sensitivity: 0.7180 Specificity: 0.7255

The glm1 gave slightly better results, but this model produced the best profit

SVM Model

set.seed(1)
tuned = tune.svm(Choice~., data= train_data, gamma = seq(0.01, .1, by = 0.01),
                 cost = seq(0.1, 1, by = 0.1))

Setting seed to 1 to make the results more reproducible.

tuned$best.parameters

##    gamma cost
## 75  0.05  0.8

Our best parameters will be a gamma 0.05 and a cost of 0.8.

tuned$performances

##     gamma cost    error dispersion
## 1    0.01  0.1 0.250000 0.03461093
## 2    0.02  0.1 0.250000 0.03461093
## 3    0.03  0.1 0.245625 0.03633013
## 4    0.04  0.1 0.242500 0.03665720
## 5    0.05  0.1 0.240000 0.03574602
## 6    0.06  0.1 0.240000 0.03387579
## 7    0.07  0.1 0.239375 0.03461720
## 8    0.08  0.1 0.240625 0.03611445
## 9    0.09  0.1 0.242500 0.03606033
## 10   0.10  0.1 0.243750 0.03560002
## 11   0.01  0.2 0.246875 0.03550846
## 12   0.02  0.2 0.226875 0.04083014
## 13   0.03  0.2 0.218750 0.03608439
## 14   0.04  0.2 0.216875 0.03294698
## 15   0.05  0.2 0.217500 0.02987241
## 16   0.06  0.2 0.221875 0.03244252
## 17   0.07  0.2 0.221250 0.03148523
## 18   0.08  0.2 0.221875 0.03149212
## 19   0.09  0.2 0.221875 0.03149212
## 20   0.10  0.2 0.221250 0.03036239
## 21   0.01  0.3 0.231875 0.03891430
## 22   0.02  0.3 0.216250 0.03309813
## 23   0.03  0.3 0.216250 0.03243583
## 24   0.04  0.3 0.215625 0.03121526
## 25   0.05  0.3 0.218750 0.03104097
## 26   0.06  0.3 0.220000 0.03030516
## 27   0.07  0.3 0.221250 0.03036239
## 28   0.08  0.3 0.220625 0.03034094
## 29   0.09  0.3 0.222500 0.03175973
## 30   0.10  0.3 0.221875 0.03162964
## 31   0.01  0.4 0.220625 0.03536148
## 32   0.02  0.4 0.216875 0.03090785
## 33   0.03  0.4 0.217500 0.03001736
## 34   0.04  0.4 0.218125 0.02982152
## 35   0.05  0.4 0.218750 0.03186887
## 36   0.06  0.4 0.218750 0.03173239
## 37   0.07  0.4 0.220625 0.03076710
## 38   0.08  0.4 0.219375 0.03179388
## 39   0.09  0.4 0.219375 0.03082348
## 40   0.10  0.4 0.220625 0.02902495
## 41   0.01  0.5 0.216250 0.03078826
## 42   0.02  0.5 0.215000 0.02963903
## 43   0.03  0.5 0.217500 0.03001736
## 44   0.04  0.5 0.216875 0.03160218
## 45   0.05  0.5 0.216875 0.03160218
## 46   0.06  0.5 0.217500 0.03170502
## 47   0.07  0.5 0.216250 0.03216710
## 48   0.08  0.5 0.216250 0.03078826
## 49   0.09  0.5 0.219375 0.02832874
## 50   0.10  0.5 0.219375 0.02642633
## 51   0.01  0.6 0.213125 0.02675279
## 52   0.02  0.6 0.215625 0.03022629
## 53   0.03  0.6 0.216250 0.03092891
## 54   0.04  0.6 0.215625 0.03079530
## 55   0.05  0.6 0.215625 0.03079530
## 56   0.06  0.6 0.213125 0.03011119
## 57   0.07  0.6 0.213750 0.03001736
## 58   0.08  0.6 0.218125 0.02755203
## 59   0.09  0.6 0.218750 0.02825971
## 60   0.10  0.6 0.220000 0.02807480
## 61   0.01  0.7 0.214375 0.02701112
## 62   0.02  0.7 0.216250 0.02993047
## 63   0.03  0.7 0.216250 0.03092891
## 64   0.04  0.7 0.215625 0.03051212
## 65   0.05  0.7 0.213750 0.03087272
## 66   0.06  0.7 0.213125 0.03011119
## 67   0.07  0.7 0.217500 0.02598744
## 68   0.08  0.7 0.216875 0.02602916
## 69   0.09  0.7 0.218125 0.02675279
## 70   0.10  0.7 0.219375 0.02848154
## 71   0.01  0.8 0.214375 0.02842052
## 72   0.02  0.8 0.216875 0.03076710
## 73   0.03  0.8 0.216250 0.03050501
## 74   0.04  0.8 0.215000 0.02829041
## 75   0.05  0.8 0.212500 0.02946278
## 76   0.06  0.8 0.215625 0.02655739
## 77   0.07  0.8 0.215000 0.02719528
## 78   0.08  0.8 0.216250 0.02571883
## 79   0.09  0.8 0.218125 0.02786531
## 80   0.10  0.8 0.219375 0.02739405
## 81   0.01  0.9 0.215000 0.02874698
## 82   0.02  0.9 0.217500 0.03016160
## 83   0.03  0.9 0.216250 0.02978511
## 84   0.04  0.9 0.214375 0.02826739
## 85   0.05  0.9 0.212500 0.02810570
## 86   0.06  0.9 0.216250 0.02687419
## 87   0.07  0.9 0.215000 0.02605416
## 88   0.08  0.9 0.216875 0.02602916
## 89   0.09  0.9 0.218750 0.02618709
## 90   0.10  0.9 0.218125 0.02576099
## 91   0.01  1.0 0.214375 0.02857282
## 92   0.02  1.0 0.216875 0.03062571
## 93   0.03  1.0 0.213750 0.02807480
## 94   0.04  1.0 0.213125 0.02996671
## 95   0.05  1.0 0.213750 0.02729087
## 96   0.06  1.0 0.215000 0.02622022
## 97   0.07  1.0 0.215625 0.02589542
## 98   0.08  1.0 0.217500 0.02822897
## 99   0.09  1.0 0.216875 0.02619538
## 100  0.10  1.0 0.217500 0.02479079

mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost)
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.8 
## 
## Number of Support Vectors:  782
## 
##  ( 371 411 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)

##     test
## pred    0    1
##    0 2054  166
##    1   42   38

caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2054  166
##          1   42   38
##                                          
##                Accuracy : 0.9096         
##                  95% CI : (0.8971, 0.921)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 0.6327         
##                                          
##                   Kappa : 0.2291         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9800         
##             Specificity : 0.1863         
##          Pos Pred Value : 0.9252         
##          Neg Pred Value : 0.4750         
##              Prevalence : 0.9113         
##          Detection Rate : 0.8930         
##    Detection Prevalence : 0.9652         
##       Balanced Accuracy : 0.5831         
##                                          
##        'Positive' Class : 0              
##

Accuracy: 0.9096 Sensitivity: 0.9800 Specificity: 0.1863

The SVM model has given the best Accuracy and Sensitivity out of the other models performed, but we can try to down size the model in order to improve our prior models. We also need to explicitly explain our results.

We can also try to do model selection on our data using the best subset of AIC

Figuring out our profit

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45
book_selling_price = 31.95
response_rate = 0.0903 # this is the variable that needs to change based on the model
profit_per_book = book_selling_price - (company_book_price + overhead)
total_customers = 50000
profit = (total_customers * response_rate) * profit_per_book - (total_customers * cost)
profit

## [1] 13553

Baseline profit from the prior trial that BBBC conducted would net $13553. We will now see how much money our models will give us in profit by using the TP/2300 (total number of testing obs) and then plug that number into the profit formula.

Questions to ask in class regarding the profit formula: Should we only use the amount of people who we predicted accurately or should we use all people who made purchases like it is suggested in the prompt?

We need to proportionally scale the profit by doing 50000/2300 since our test data set has 2300 observations and we need to alter the amount of people we are mailing to based on the amount of people predicted in the model.

My main concern for our formula is that we are factoring out all responses that are 0, which in terms of the question makes sense. In the example that BBBC conducted with 20,000 though, they got 9.03% (~1800) of people who purchased the book. This means that our formula does not align with that one AND I would assume that their number INCLUDES BOTH TP & FP in it instead of just correct predictions. Does our formula work since our main focus is to only send out mail to people who are most likely to make a purchase?

Using a multiplier is fine and we do not need to worry about 9.03% anymore after we get the comparative model baseline, which was 13553.

total_customers = 148 + 56 #TP & FP people who purchased
response_rate = 148/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 29934.78

Our response rate is TP/(TP + FP) because this is the amount of people we can accurately predict from our model who purchased the book. We will be eliminating all the other responses from the test set because those people do not make us profit and then we will use a multiplier of 50000/test set size in order to scale up our model.

Currently, $29,934.78 from glm2 is my best profit margin. I will re-balance the data and see if that will change my results.

Unsampled Model

# Assuming 'df' is your dataframe with 'response' as the column containing 0s and 1s
set.seed(1)
# Calculate the number of instances in each class
counts <- table(train_data$Choice)

# Identify the minority and majority classes
minority_class <- names(counts)[which.min(counts)]
majority_class <- names(counts)[which.max(counts)]

# Determine the number of instances in the minority class
minority_count <- counts[minority_class]

# Undersample the majority class
undersampled_train_data <- train_data %>%
  group_by(Choice) %>%
  sample_n(minority_count)

# Now undersampled_df contains an equal number of instances from both classes

undersampled_train_data$Choice = as.factor(undersampled_train_data$Choice)

Dr. Roy said we need to balance the test set as well so I will do that now

# Assuming 'df' is your dataframe with 'response' as the column containing 0s and 1s

# Calculate the number of instances in each class
counts <- table(test_data$Choice)

# Identify the minority and majority classes
minority_class <- names(counts)[which.min(counts)]
majority_class <- names(counts)[which.max(counts)]

# Determine the number of instances in the minority class
minority_count <- counts[minority_class]

# Undersample the majority class
undersampled_test_data <- test_data %>%
  group_by(Choice) %>%
  sample_n(minority_count)

# Now undersampled_df contains an equal number of instances from both classes

undersampled_test_data$Choice = as.factor(undersampled_test_data$Choice)

undersampled_test_size = nrow(undersampled_test_data)
undersampled_test_size

## [1] 408

Creating this to have the number of undersampled test size in a set variable.

und_train_glm1 = glm(Choice ~ . , data = undersampled_train_data, family = binomial)
summary(und_train_glm1)

## 
## Call:
## glm(formula = Choice ~ ., family = binomial, data = undersampled_train_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.7065398  0.2617611   2.699 0.006951 ** 
## Gender1          -0.8029147  0.1716667  -4.677 2.91e-06 ***
## Amount_purchased  0.0020443  0.0009902   2.064 0.038973 *  
## Frequency        -0.0793464  0.0204223  -3.885 0.000102 ***
## Last_purchase     0.4578916  0.1160305   3.946 7.94e-05 ***
## First_purchase   -0.0067712  0.0160598  -0.422 0.673301    
## P_Child          -0.6544372  0.1397434  -4.683 2.83e-06 ***
## P_Youth          -0.4593219  0.1792462  -2.563 0.010392 *  
## P_Cook           -0.8477684  0.1525216  -5.558 2.72e-08 ***
## P_DIY            -0.8355141  0.1802842  -4.634 3.58e-06 ***
## P_Art             0.8204335  0.1692281   4.848 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1109.04  on 799  degrees of freedom
## Residual deviance:  863.32  on 789  degrees of freedom
## AIC: 885.32
## 
## Number of Fisher Scoring iterations: 4

AIC is 885.32 in the first undersampled logistic regression

vif(und_train_glm1)

##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.037176         1.277417         2.820257        17.376208 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.598640         2.866003         1.812315         3.482302 
##            P_DIY            P_Art 
##         1.931103         1.921508

Last_purchase has a VIF of 17 so I will remove it to reduce correlations / multicollinearity between the data

und_train_glm2 = glm(Choice ~ .-Last_purchase , data = undersampled_train_data, family = binomial)
summary(und_train_glm2)

## 
## Call:
## glm(formula = Choice ~ . - Last_purchase, family = binomial, 
##     data = undersampled_train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.874031   0.254078   3.440 0.000582 ***
## Gender1          -0.784309   0.169746  -4.620 3.83e-06 ***
## Amount_purchased  0.002319   0.000977   2.373 0.017635 *  
## Frequency        -0.114884   0.018296  -6.279 3.41e-10 ***
## First_purchase    0.028265   0.013196   2.142 0.032198 *  
## P_Child          -0.323674   0.108465  -2.984 0.002844 ** 
## P_Youth          -0.081304   0.149327  -0.544 0.586118    
## P_Cook           -0.482244   0.118626  -4.065 4.80e-05 ***
## P_DIY            -0.502835   0.155436  -3.235 0.001216 ** 
## P_Art             1.131269   0.151564   7.464 8.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1109.04  on 799  degrees of freedom
## Residual deviance:  880.11  on 790  degrees of freedom
## AIC: 900.11
## 
## Number of Fisher Scoring iterations: 4

AIC 900.11

vif(und_train_glm2)

##           Gender Amount_purchased        Frequency   First_purchase 
##         1.037407         1.278347         2.329260         6.516841 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.809505         1.270922         2.107647         1.475285 
##            P_Art 
##         1.533222

First_purchase has a VIF of 6 so we could potentially take it out, I will check my AIC score on the next model with Last & First Purchase removed to see if it is worth removing.

und_train_glm3 = glm(Choice ~ .-Last_purchase-First_purchase , 
                     data = undersampled_train_data, family = binomial)
summary(und_train_glm3)

## 
## Call:
## glm(formula = Choice ~ . - Last_purchase - First_purchase, family = binomial, 
##     data = undersampled_train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.754020   0.246809   3.055 0.002250 ** 
## Gender1          -0.772564   0.169374  -4.561 5.08e-06 ***
## Amount_purchased  0.002431   0.000973   2.499 0.012471 *  
## Frequency        -0.085913   0.012039  -7.136 9.61e-13 ***
## P_Child          -0.191758   0.088457  -2.168 0.030173 *  
## P_Youth           0.045496   0.135068   0.337 0.736242    
## P_Cook           -0.321962   0.090464  -3.559 0.000372 ***
## P_DIY            -0.360495   0.140322  -2.569 0.010198 *  
## P_Art             1.282755   0.135910   9.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1109.04  on 799  degrees of freedom
## Residual deviance:  884.76  on 791  degrees of freedom
## AIC: 902.76
## 
## Number of Fisher Scoring iterations: 4

AIC went up to 902.76. I can just run all 3 models and just see what happens.

Undersampled Logit 1

probabilities = predict(und_train_glm1,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4583  1.0000  1.0000

pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

pr_class = ifelse(probabilities>0.46,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 146  57
##          1  58 147
##                                           
##                Accuracy : 0.7181          
##                  95% CI : (0.6718, 0.7613)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4363          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7157          
##             Specificity : 0.7206          
##          Pos Pred Value : 0.7192          
##          Neg Pred Value : 0.7171          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3578          
##    Detection Prevalence : 0.4975          
##       Balanced Accuracy : 0.7181          
##                                           
##        'Positive' Class : 0               
##

Acc: 0.7181 Sens: 0.7157 Spec: 0.7206 TP: 147 FP: 57

Undersampled Logit 2

probabilities = predict(und_train_glm2,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4681  1.0000  1.0000

pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

pr_class = ifelse(probabilities>0.47,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 147  57
##          1  57 147
##                                           
##                Accuracy : 0.7206          
##                  95% CI : (0.6743, 0.7636)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4412          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7206          
##             Specificity : 0.7206          
##          Pos Pred Value : 0.7206          
##          Neg Pred Value : 0.7206          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3603          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.7206          
##                                           
##        'Positive' Class : 0               
##

Acc: 0.7206 Sens: 0.7206 Spec: 0.7206 TP: 147 FP: 57

Undersampled Logit 3

probabilities = predict(und_train_glm3,newdata = undersampled_test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.473   1.000   1.000

pred <- prediction(probabilities,undersampled_test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

pr_class = ifelse(probabilities>0.48,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 144  59
##          1  60 145
##                                          
##                Accuracy : 0.7083         
##                  95% CI : (0.6616, 0.752)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.4167         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.7059         
##             Specificity : 0.7108         
##          Pos Pred Value : 0.7094         
##          Neg Pred Value : 0.7073         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3529         
##    Detection Prevalence : 0.4975         
##       Balanced Accuracy : 0.7083         
##                                          
##        'Positive' Class : 0              
##

Acc: 0.7083 Sens: 0.7059 Spec: 0.7108 TP: 145 FP: 59

All 3 of these models had the same TP and FP rate except for Undersampled Logit 3 but I will plug in the best two into the formula below

total_customers = 147 + 57 #TP & FP people who purchased
response_rate = 147/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 29713.04

When changing the multiplier to 50000/408 when we down sampled the test set, current profit would now be $167,000. But we will keep the multiplier of 50000/2300 because it would seem unrealistic to have 70% of the people of the Midwest people purchase the book.

Undersampled SVM

set.seed(1)
tuned = tune.svm(Choice~., data= undersampled_train_data, gamma = seq(0.01, .1, by = 0.01),
                 cost = seq(0.1, 1, by = 0.1))

Setting seed to 1 to make the results more reproducible.

tuned$best.parameters

##    gamma cost
## 71  0.01  0.8

Our best parameters will be a gamma 0.01 and a cost of 0.8.

tuned$performances

##     gamma cost   error dispersion
## 1    0.01  0.1 0.34875 0.03408018
## 2    0.02  0.1 0.33000 0.03184162
## 3    0.03  0.1 0.32375 0.03653860
## 4    0.04  0.1 0.32500 0.03996526
## 5    0.05  0.1 0.31750 0.04005205
## 6    0.06  0.1 0.31375 0.03653860
## 7    0.07  0.1 0.31125 0.03972562
## 8    0.08  0.1 0.31250 0.04409586
## 9    0.09  0.1 0.31000 0.04743416
## 10   0.10  0.1 0.31375 0.04505013
## 11   0.01  0.2 0.32125 0.03775377
## 12   0.02  0.2 0.31375 0.04267529
## 13   0.03  0.2 0.31250 0.04330127
## 14   0.04  0.2 0.31250 0.03864008
## 15   0.05  0.2 0.31250 0.04082483
## 16   0.06  0.2 0.31500 0.04362084
## 17   0.07  0.2 0.31750 0.04571956
## 18   0.08  0.2 0.31750 0.04495368
## 19   0.09  0.2 0.31750 0.04866267
## 20   0.10  0.2 0.31250 0.04639804
## 21   0.01  0.3 0.31125 0.04059026
## 22   0.02  0.3 0.30375 0.04450733
## 23   0.03  0.3 0.31500 0.04199868
## 24   0.04  0.3 0.31875 0.04573854
## 25   0.05  0.3 0.31875 0.04759858
## 26   0.06  0.3 0.31625 0.04931827
## 27   0.07  0.3 0.31500 0.05230785
## 28   0.08  0.3 0.31375 0.05050096
## 29   0.09  0.3 0.31125 0.05152197
## 30   0.10  0.3 0.30250 0.05130248
## 31   0.01  0.4 0.31375 0.04543387
## 32   0.02  0.4 0.31000 0.04556741
## 33   0.03  0.4 0.32000 0.04758034
## 34   0.04  0.4 0.31375 0.04803428
## 35   0.05  0.4 0.31000 0.05197489
## 36   0.06  0.4 0.31125 0.05219155
## 37   0.07  0.4 0.31125 0.05573063
## 38   0.08  0.4 0.31125 0.05185785
## 39   0.09  0.4 0.30375 0.05304937
## 40   0.10  0.4 0.30125 0.04839436
## 41   0.01  0.5 0.30250 0.04851976
## 42   0.02  0.5 0.31125 0.04427267
## 43   0.03  0.5 0.31250 0.05651942
## 44   0.04  0.5 0.31000 0.05130248
## 45   0.05  0.5 0.31000 0.05394184
## 46   0.06  0.5 0.30625 0.05781015
## 47   0.07  0.5 0.30875 0.05714565
## 48   0.08  0.5 0.30500 0.05502525
## 49   0.09  0.5 0.30625 0.05376453
## 50   0.10  0.5 0.30500 0.05143766
## 51   0.01  0.6 0.30125 0.04543387
## 52   0.02  0.6 0.31500 0.05296750
## 53   0.03  0.6 0.31000 0.05886661
## 54   0.04  0.6 0.31375 0.05696307
## 55   0.05  0.6 0.30875 0.05591723
## 56   0.06  0.6 0.31125 0.05573063
## 57   0.07  0.6 0.31250 0.05204165
## 58   0.08  0.6 0.30625 0.06187184
## 59   0.09  0.6 0.30500 0.06433420
## 60   0.10  0.6 0.30250 0.05737305
## 61   0.01  0.7 0.30250 0.04479893
## 62   0.02  0.7 0.31000 0.06476453
## 63   0.03  0.7 0.31625 0.05684103
## 64   0.04  0.7 0.31125 0.05964304
## 65   0.05  0.7 0.30875 0.05775006
## 66   0.06  0.7 0.31000 0.05329426
## 67   0.07  0.7 0.31000 0.05945353
## 68   0.08  0.7 0.30375 0.06010696
## 69   0.09  0.7 0.30375 0.05864500
## 70   0.10  0.7 0.30500 0.05839283
## 71   0.01  0.8 0.30125 0.03972562
## 72   0.02  0.8 0.31125 0.06652328
## 73   0.03  0.8 0.31375 0.05478810
## 74   0.04  0.8 0.30875 0.06265259
## 75   0.05  0.8 0.31000 0.05886661
## 76   0.06  0.8 0.31375 0.05787019
## 77   0.07  0.8 0.30500 0.05627314
## 78   0.08  0.8 0.30500 0.05957022
## 79   0.09  0.8 0.30625 0.05628857
## 80   0.10  0.8 0.30500 0.05658082
## 81   0.01  0.9 0.30125 0.04016027
## 82   0.02  0.9 0.31000 0.06003471
## 83   0.03  0.9 0.30750 0.06593347
## 84   0.04  0.9 0.31000 0.06341004
## 85   0.05  0.9 0.31250 0.05863020
## 86   0.06  0.9 0.31000 0.05197489
## 87   0.07  0.9 0.30375 0.05714565
## 88   0.08  0.9 0.30500 0.06129392
## 89   0.09  0.9 0.30875 0.05434266
## 90   0.10  0.9 0.30750 0.05927806
## 91   0.01  1.0 0.30625 0.04497299
## 92   0.02  1.0 0.31250 0.05951190
## 93   0.03  1.0 0.31000 0.06202598
## 94   0.04  1.0 0.31625 0.06375136
## 95   0.05  1.0 0.31125 0.06022239
## 96   0.06  1.0 0.30625 0.05179085
## 97   0.07  1.0 0.30625 0.05781015
## 98   0.08  1.0 0.30375 0.05981743
## 99   0.09  1.0 0.31000 0.05945353
## 100  0.10  1.0 0.31000 0.05945353

mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost)
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.8 
## 
## Number of Support Vectors:  612
## 
##  ( 307 305 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)

##     test
## pred   0   1
##    0 160  77
##    1  44 127

caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 160  77
##          1  44 127
##                                           
##                Accuracy : 0.7034          
##                  95% CI : (0.6565, 0.7474)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4069          
##                                           
##  Mcnemar's Test P-Value : 0.003625        
##                                           
##             Sensitivity : 0.7843          
##             Specificity : 0.6225          
##          Pos Pred Value : 0.6751          
##          Neg Pred Value : 0.7427          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3922          
##    Detection Prevalence : 0.5809          
##       Balanced Accuracy : 0.7034          
##                                           
##        'Positive' Class : 0               
##

Acc: 0.7822 Sens: 0.7977 Spec: 0.6225 TP: 127 FP: 77

total_customers = 127 + 77 #TP & FP people who purchased
response_rate = 127/total_customers #TP & FP people we accurately predicted to purchase
multiplier = 50000/2300 # 50000 / test data set size
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 25278.26

Now we lost ~$4000… I will just focus on the odds ratio to see who the ideal customer is for now before making more possible models.

Our best profit would be coming from the undersampled logits 1 & 2.

Ideal Customers / Odds Ratio

The best profit model (so far) seems to be glm1 so I will focus on the odds ratio pertaining to that model.

round(exp(train_glm2$coefficients),3)

##      (Intercept)          Gender1 Amount_purchased        Frequency 
##            0.862            0.436            1.002            0.887 
##   First_purchase          P_Child          P_Youth           P_Cook 
##            1.031            0.708            0.836            0.633 
##            P_DIY            P_Art 
##            0.653            2.938

Gender1: 0.436 0 = female, 1 = male Amount_purchased: 1.002 Frequency: 0.887 First_purchase: 1.031 P_Child: 0.708 P_Youth: 0.836 P_Cook: 0.633 P_DIY: 0.653 P_Art: 2.938

This means that if someone is a female they are 137% more likely to buy the target book. These odds ratios are if all other variables are held constant

A one unit increase in Amount_purchase is 0.002% increased odds of someone purchasing our target book. A one unit increase in P_Art is a 193.8% increased odds of someone purchasing our target book. For every one unit decrease in Frequency, they are 11.3% less likely to purchase the target book. For every one unit decrease in First_purchase months, they are 3.1% less likely to purchase our target book. For every one unit decrease in P_Child, they are ~29.2% less likely to purchase our target book. For every one unit decrease in P_Youth, they are 16.4% less likely to purchase our target book. For every one unit decrease in P_Cook, they are 36.7% less likely to purchase our target book. For every one unit decrease in P_DIY, they are 34.7% less likely to purchase our target book. The odds of females purchasing the target book is 2.29 compared to the male 0.436. This means that if someone is a female they are 129% more likely to buy the target book.

These odds ratios are if all other variables are held constant. The type of customer most likely to purchase our target book would be a female who has purchased art books frequently from BBBC recently compared to a male who does not purchase art books and has not made a purchase recently.

AIC Model Selection

model.null = glm(Choice ~ 1, data=train_data, family = binomial) # null model : no predictor
model.full = glm(Choice ~ ., data=train_data, family = binomial) # full model: all predictors

step.models.AIC<-step(model.null, scope = list(upper=model.full),
                  direction="both",test="Chisq", trace = F) 
summary(step.models.AIC)

## 
## Call:
## glm(formula = Choice ~ P_Art + Frequency + Gender + P_Cook + 
##     P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth, 
##     family = binomial, data = train_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.2833949  0.2062721  -1.374   0.1695    
## P_Art             0.6643371  0.1255243   5.292 1.21e-07 ***
## Frequency        -0.0903261  0.0106304  -8.497  < 2e-16 ***
## Gender1          -0.8660575  0.1373268  -6.307 2.85e-10 ***
## P_Cook           -0.9330131  0.1190073  -7.840 4.51e-15 ***
## P_DIY            -0.9101106  0.1433591  -6.348 2.17e-10 ***
## Amount_purchased  0.0018357  0.0007908   2.321   0.0203 *  
## P_Child          -0.8181807  0.1163377  -7.033 2.02e-12 ***
## Last_purchase     0.5536689  0.0784519   7.057 1.70e-12 ***
## P_Youth          -0.6424923  0.1432548  -4.485 7.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1799.5  on 1599  degrees of freedom
## Residual deviance: 1393.5  on 1590  degrees of freedom
## AIC: 1413.5
## 
## Number of Fisher Scoring iterations: 5

Looks like we lost First_purchase

probabilities = predict(step.models.AIC,newdata = test_data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

summary(predicted.classes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08696 0.00000 1.00000

pred <- prediction(probabilities,test_data$Choice)
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE)
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 3)

pr_class = ifelse(probabilities>0.22,1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1503   56
##          1  593  148
##                                           
##                Accuracy : 0.7178          
##                  95% CI : (0.6989, 0.7361)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2023          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7171          
##             Specificity : 0.7255          
##          Pos Pred Value : 0.9641          
##          Neg Pred Value : 0.1997          
##              Prevalence : 0.9113          
##          Detection Rate : 0.6535          
##    Detection Prevalence : 0.6778          
##       Balanced Accuracy : 0.7213          
##                                           
##        'Positive' Class : 0               
##

Acc: 0.71718 Sens: 0.7171 Spec: 0.7255 TP: 148 FP : 56

train_lda = lda(Choice ~ ., data = train_data)
train_lda

## Call:
## lda(Choice ~ ., data = train_data)
## 
## Prior probabilities of groups:
##    0    1 
## 0.75 0.25 
## 
## Group means:
##   Gender1 Amount_purchased Frequency Last_purchase First_purchase   P_Child
## 0  0.6975         194.4167  13.33667      2.951667       22.54667 0.7341667
## 1  0.5425         220.4125   9.24500      3.940000       22.66500 0.7550000
##   P_Youth    P_Cook     P_DIY     P_Art
## 0  0.3275 0.7841667 0.3933333 0.2733333
## 1  0.3675 0.6875000 0.3850000 0.8800000
## 
## Coefficients of linear discriminants:
##                           LD1
## Gender1          -0.707433464
## Amount_purchased  0.001478478
## Frequency        -0.049100848
## Last_purchase     0.524297499
## First_purchase   -0.010820025
## P_Child          -0.682241865
## P_Youth          -0.520664659
## P_Cook           -0.764550263
## P_DIY            -0.730726880
## P_Art             0.636803402

p1 <- predict(train_lda, train_data)$class
tab <- table(Predicted = p1, Actual = train_data$Choice)
tab

##          Actual
## Predicted    0    1
##         0 1125  243
##         1   75  157

sum(diag(tab))/sum(tab)

## [1] 0.80125

p2 <- predict(train_lda, test_data)$class
tab1 <- table(Predicted = p2, Actual = test_data$Choice)
tab1

##          Actual
## Predicted    0    1
##         0 1968  127
##         1  128   77

sum(diag(tab1))/sum(tab1)

## [1] 0.8891304

Acc: 0.8891 Sens: 0.3756
Spec: 0.9394 TP: 77 FP: 127 This model will perform poorly on profit due to many FPs.

Final Notes

2/26/24: I tried to look into changing the 0,1 but I am pretty sure they are correct. That was just based on feedback he gave to another group in class but I had high accuracy just low specificity. I have gone back and updated the profit formula for the under-sampled groups due to the change in the test data set being under-sampled based on his feedback from class. This pretty much increased profits by 500% since our original test data set was 2300 and when under-sampling it became 408. This led to under-sampled Logit 1 & 2 being the best models at $167,500 profit but I am unsure if I feel this would be correct.

I feel the only ways to attempt to improve the current models would be to create another model with differing features but every time I have removed variables AIC has gone down and the models usually become slightly worse by gaining a few False Positives.

Currently, the profit formula is the following:

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 127 + 77 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 127/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 25278.26

After adjusting the formula to only use the multiplier of 50000/2300 because it is unrealistic to expect a 72% purchase rate from the random Midwest BBBC subscribers, we then will use Logit 2 as our best model.The best model is $29,934.78 from glm2 which is comparable to under-sampled glm1 with a profit of $29,713.04.

Visualizations

total_profit = c(29047.83,29934.78, 29713.04 , 29713.04 , 29269.57, 25278.26,
                 27273.91, 25056.52, 24834.78) #profit per model
model_seq = c('glm1', 'glm2', 'und_glm1','und_glm2', 'und_glm3', 'und_svm_radial', 
              'und_svm_linear', 'und_svm_mononomial', 'und_svm_sigmoid') #using this as a y variable for graph
df = data.frame(total_profit,model_seq)

# barchart with added parameters
barplot(total_profit,
main = "Total Profit Generated Per Model",
xlab = "Model Name",
ylab = "Profit in $",
names.arg = model_seq,
col = "darkred")

my_colors <- c("#FF5733", "#33FF57", "#3366FF", "#FF33FF")
library(viridis)

## Loading required package: viridisLite

dfplot = ggplot(data=df,
       aes(x=model_seq, y = total_profit, fill = model_seq)) +
  geom_col(color = 'black', show.legend = F)+
  geom_text(aes(label=total_profit), hjust = 1)+
  labs(title = "Total Profit Generated Per Model",
       x = 'Model Name',
       y = 'Profit in $' ) +
  scale_fill_viridis_d(option = "D", end = 0.9) +
  theme_minimal()
dfplot + coord_flip()

Excluding svm because it performed poorly on profit (~$5500 profit which is worse than our baseline of ~$13000) and made the graph look ugly.

Adjusting SVM Kernels

mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'linear')
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.8 
## 
## Number of Support Vectors:  519
## 
##  ( 259 260 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)

##     test
## pred   0   1
##    0 158  68
##    1  46 136

caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 158  68
##          1  46 136
##                                           
##                Accuracy : 0.7206          
##                  95% CI : (0.6743, 0.7636)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4412          
##                                           
##  Mcnemar's Test P-Value : 0.0492          
##                                           
##             Sensitivity : 0.7745          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.6991          
##          Neg Pred Value : 0.7473          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3873          
##    Detection Prevalence : 0.5539          
##       Balanced Accuracy : 0.7206          
##                                           
##        'Positive' Class : 0               
##

Under-sampled SVM Linear Kernel Acc: 0.7206 Sens: 0.7745 Spec: 0.6667 TP: 136 FP: 68

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 136 + 68 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 136/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 27273.91

mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'polynomial', degree = 1)
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "polynomial", degree = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.8 
##      degree:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  648
## 
##  ( 323 325 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)

##     test
## pred   0   1
##    0 163  78
##    1  41 126

caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 163  78
##          1  41 126
##                                          
##                Accuracy : 0.7083         
##                  95% CI : (0.6616, 0.752)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4167         
##                                          
##  Mcnemar's Test P-Value : 0.0009665      
##                                          
##             Sensitivity : 0.7990         
##             Specificity : 0.6176         
##          Pos Pred Value : 0.6763         
##          Neg Pred Value : 0.7545         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3995         
##    Detection Prevalence : 0.5907         
##       Balanced Accuracy : 0.7083         
##                                          
##        'Positive' Class : 0              
##

SVM Mono-nomial Kernel Acc: 0.7083 Sens: 0.7990 Spec: 0.6176 TP: 126 FP: 78

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 126 + 78 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 126/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 25056.52

mysvm = svm(Choice~., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'sigmoid')
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = undersampled_train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  0.8 
##      coef.0:  0 
## 
## Number of Support Vectors:  647
## 
##  ( 323 324 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, undersampled_test_data, type = 'response')
table(pred = svm_pred, test = undersampled_test_data$Choice)

##     test
## pred   0   1
##    0 163  79
##    1  41 125

caret::confusionMatrix(as.factor(svm_pred),as.factor(undersampled_test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 163  79
##          1  41 125
##                                           
##                Accuracy : 0.7059          
##                  95% CI : (0.6591, 0.7497)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4118          
##                                           
##  Mcnemar's Test P-Value : 0.0007312       
##                                           
##             Sensitivity : 0.7990          
##             Specificity : 0.6127          
##          Pos Pred Value : 0.6736          
##          Neg Pred Value : 0.7530          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3995          
##    Detection Prevalence : 0.5931          
##       Balanced Accuracy : 0.7059          
##                                           
##        'Positive' Class : 0               
##

Undersampled SVM Sigmoid Kernel Acc: 0.7059 Sens: 0.7990 Spec: 0.6127 TP: 125 FP: 79

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 125 + 79 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 125/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 24834.78

NORMAL DATA SET SVMS

mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'linear')
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.8 
## 
## Number of Support Vectors:  737
## 
##  ( 367 370 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)

##     test
## pred    0    1
##    0 2011  147
##    1   85   57

caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2011  147
##          1   85   57
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8861, 0.9111)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.9802          
##                                           
##                   Kappa : 0.2768          
##                                           
##  Mcnemar's Test P-Value : 6.206e-05       
##                                           
##             Sensitivity : 0.9594          
##             Specificity : 0.2794          
##          Pos Pred Value : 0.9319          
##          Neg Pred Value : 0.4014          
##              Prevalence : 0.9113          
##          Detection Rate : 0.8743          
##    Detection Prevalence : 0.9383          
##       Balanced Accuracy : 0.6194          
##                                           
##        'Positive' Class : 0               
##

SVM Linear Kernel Acc: 0.8991 Sens: 0.9594 Spec: 0.2794 TP: 57 FP: 147

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 57 + 147 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 57/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 9756.522

mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'polynomial', degree = 1)
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "polynomial", degree = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.8 
##      degree:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  793
## 
##  ( 395 398 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)

##     test
## pred    0    1
##    0 2070  176
##    1   26   28

caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2070  176
##          1   26   28
##                                           
##                Accuracy : 0.9122          
##                  95% CI : (0.8999, 0.9234)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.4601          
##                                           
##                   Kappa : 0.1869          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9876          
##             Specificity : 0.1373          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.5185          
##              Prevalence : 0.9113          
##          Detection Rate : 0.9000          
##    Detection Prevalence : 0.9765          
##       Balanced Accuracy : 0.5624          
##                                           
##        'Positive' Class : 0               
##

SVM Mono-nomial Kernel Acc: 0.9122 Sens: 0.9876 Spec: 0.1374 TP: 28 FP: 176

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 28 + 176 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 28/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 3326.087

mysvm = svm(Choice~., data = train_data, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, kernel = 'sigmoid')
summary(mysvm)

## 
## Call:
## svm(formula = Choice ~ ., data = train_data, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost, kernel = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  0.8 
##      coef.0:  0 
## 
## Number of Support Vectors:  794
## 
##  ( 395 399 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

svm_pred = predict(mysvm, test_data, type = 'response')
table(pred = svm_pred, test = test_data$Choice)

##     test
## pred    0    1
##    0 2070  176
##    1   26   28

caret::confusionMatrix(as.factor(svm_pred),as.factor(test_data$Choice))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2070  176
##          1   26   28
##                                           
##                Accuracy : 0.9122          
##                  95% CI : (0.8999, 0.9234)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 0.4601          
##                                           
##                   Kappa : 0.1869          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9876          
##             Specificity : 0.1373          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.5185          
##              Prevalence : 0.9113          
##          Detection Rate : 0.9000          
##    Detection Prevalence : 0.9765          
##       Balanced Accuracy : 0.5624          
##                                           
##        'Positive' Class : 0               
##

SVM Sigmoid Kernel Acc: 0.9122 Sens: 0.9876 Spec: 0.1373 TP: 28 FP: 176

cost = 0.65 # $0.65 to mail the catalog out
company_book_price = 15 
overhead = company_book_price * 0.45 #6.75
book_selling_price = 31.95
profit_per_book = book_selling_price - (company_book_price + overhead) #10.2
total_customers = 28 + 176 #TP + FP people who purchased - From UndSam Logit 1&2
response_rate = 28/total_customers #TP/TP+FP :people we accurately predicted to purchase out all who purchased
multiplier = 50000/2300 # 50000 / test data set size - scales our profit to study
profit = ((total_customers * response_rate) * profit_per_book - (total_customers * cost)) * multiplier
profit

## [1] 3326.087

Case Study 2

Jordan Dever

2024-02-13