Throughout your early career as a data scientist whether that was exploring NBA talent, guiding climate change policy investment or better understanding how to create better commercials you’ve suddenly realized you need to enhance your ability to assess the models you are building. As the most important part about understanding any machine learning model is understanding it’s weakness or better said it’s vulnerabilities.

In doing so you’ve decided to revisit your last consulting gig and gather a sense of how to discuss the good and the bad of these outcomes.

Part 1. Revisit both the example from the KNN in class lecture and the model you developed for the lab. Run through a series of evaluation measures we have covered in class and discuss the outcome in terms of the question you are trying to solve. Focus on Accuracy, TPR, FPR, F1, Kappa and ROC/AUC. Make sure to calculate the base rate or prevalence to provide a reference for some of these measures. (The question from in-class lecture is centered on building a classifier to target potential new customers.)

Part 2. Take a closer look at where miss-classification errors are occurring, is there a pattern? Use bias and LogLoss R functions to get a better understand of what’s going on.

Part 3. Based on your exploration in Part 2, change the threshold using the function provided, what differences do you see in the evaluation metrics? Speak specifically to the metrics you think are best suited to address the questions you are trying to answer.

Part 4. Summarize your findings to include recommendations on how you might change each of the two KNN models based on the results. These recommendations might include gathering more data, adjusting the threshold or maybe that it’s working fine at the current level and nothing should be done. Regardless of the outcome, what should we be aware of when these models are deployed?

Evaluation: Commercial Label Data

Data Setup

bank_data = read.csv("bank.csv",
                     check.names = FALSE, 
                     stringsAsFactors = FALSE)


bank_data[, c("age","duration","balance")] <- lapply(bank_data[, c("age","duration","balance")],function(x) scale(x))

set.seed(1982)
bank_data_train_rows = sample(1:nrow(bank_data),
                              round(0.8 * nrow(bank_data), 0),  
                              replace = FALSE)

# Setting up the Training Data
bank_data_train = bank_data[bank_data_train_rows, ]

# Setting up the Test Data                                                   
bank_data_test = bank_data[-bank_data_train_rows, ] 

library(class)

set.seed(1982)
# Training the data using k = 3
bank_3NN <-  knn(train = bank_data_train[, c("age", "balance", "duration")],
               test = bank_data_test[, c("age", "balance", "duration")],    
               cl = bank_data_train[, "signed up"],
               k = 3,
               use.all = TRUE,
               prob = TRUE) 

Creating a Confusion Matrix

table(bank_3NN, bank_data_test$`signed up`)
##         
## bank_3NN    0    1
##        0 7290  780
##        1  402  254

Given this basic confusion matrix, we can manually calculate the accuracy, error rate, true positive rate, true negative rate, false negative rate, and false positive rate. However, we can use the confusion matrix function from the caret function to calculate these values for us. It also will provide further insight on some other baseline and evaluation metrics.

bank_conf_mat <- confusionMatrix(as.factor(bank_3NN), as.factor(bank_data_test$`signed up`), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
bank_conf_mat
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 7290  780
##          1  402  254
##                                           
##                Accuracy : 0.8645          
##                  95% CI : (0.8572, 0.8717)
##     No Information Rate : 0.8815          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2297          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.24565         
##             Specificity : 0.94774         
##          Pos Pred Value : 0.38720         
##          Neg Pred Value : 0.90335         
##              Prevalence : 0.11850         
##          Detection Rate : 0.02911         
##    Detection Prevalence : 0.07518         
##       Balanced Accuracy : 0.59669         
##                                           
##        'Positive' Class : 1               
## 

When analyzing the output, it is easiest to start with the accuracy, which is fairly high at 86.45%. However, when we move on to look at the sensitivity and specificity, we notice there is a better story to be told. Although we tout a high accuracy, we see that is a grave imbalance in our sensitivity and specificity. Although the true negative rate, or specificity, is extremely high at 94.77%, we see that our true positive rate, or sensitivity, is extremely low at 24.57%. Arguably, given the context of the problem, it is more advantageous to maximize the sensitivity since it is more desirable to correctly identify those who would sign up versus being able to accurately predict those who would not sign up. This is why we must look at different metrics other than overall accuracy; our accuracy is so high because we are highly accurate in predicting those who would not sign up (the 0’s), but are performing horribly in predicting the group we care more about. Furthermore, the likely explanation for this phenomenon is due to the fact that we have an imbalanced sample, as exhibited by the prevalence value. Our prevalence value tells us that only 11.85% of the sample data are those who actually signed up (class 1), meaning that it intuitively makes sense that the model will not be as good as predicting that class since it simply has much less data to train on. We see this also effecting the results of the pos pred value and neg pred value, which tells us that when we predicted the class, what is the probability the class we predicted is the true class. Once again, with these, we find that the neg pred value is higher, with the pos pred value is low.

Looking at another metric, we can assess our kappa value, which acts almost as a grade of our model. At a value of 0.2297, which we can further interpret as a very poor grade of our model. This means that our model is performing very poorly, which bolsters our findings from the other metrics discussed previously.

Since we are looking at imbalanced data, we should also look at the F1 score. Since we need the false positive and false negative rate to calculate our F1 score, we also will comment on these metrics as well.

Calculating the F1 Score, FPR, and FNR

tp <- 0.24565
fp <- 402 / (402 + 254)
fn <- 780 / (780 + 7290)
bank_F1 <- tp / (tp + (0.5*(fn + fp)))
paste("F1 score: ", toString(bank_F1))
## [1] "F1 score:  0.409157822128228"
paste("False Positive Rate: ", toString(fp))
## [1] "False Positive Rate:  0.61280487804878"
paste("False Negative Rate: ", toString(fn))
## [1] "False Negative Rate:  0.0966542750929368"

Our F1 score will give us the balance between precision and recall, also defined as a harmonic average between the two. A “good” F1 score will tell us that there is a good balance between precision and recall, and will converge towards 1. The F1 score takes on continuous values between 0 and 1, with a more desirable score closer to 1 as explained before. Upon calculating our F1 score, we find it to be 0.4092, indicating our balance between precision and recall is relatively poor. Furthermore, we look at the FPR and the FNR, which are 61.28% and 9.67% respectively. In this instance, we see that we are very rarely misclassifying someone as a zero, but are misclassifying more than half of the time someone as a 1, which could be extremely costly. Identifying someone as an individual who would sign up when in reality they wouldn’t has cost implications, as you are allocating money and resources to capture a customer that will not sign up. We also want to maintain a low false negative rate though, however, as misclassifying someone as not signing up even though they would sign up means that you are missing out on capturing revenue from a willing customer. Since neither, arguably, are more detrimental than the other, our primary focus should be to find a balance between the two, and attempt to maximize the TPR, which we will attempt to do with an ROC curve.

Now we will take a look at bias; if our model is unbiased, this metric will be closer to zero.

prob_knn <- tibble(bank_3NN, attributes(bank_3NN)$prob)
prob_knn$prob <- if_else(prob_knn$bank_3NN == 0,
                         1-prob_knn$`attributes(bank_3NN)$prob`, prob_knn$`attributes(bank_3NN)$prob`) #### this is a example of converting the probabilities to the correct format.
bank_eval <- data.frame(pred_class = bank_3NN, 
                        pred_prob = prob_knn$prob, 
                        target = as.numeric(bank_data_test$`signed up`))

bias(as.numeric(bank_data_test$`signed up`), as.numeric(bank_eval$pred_class))
## [1] -0.9566812

This metric deviates pretty substantially from zero, and thus we can say the model is likely biased. Next, and finally, we will take a look at our log loss value. Ideally, the value will be close to zero.

LogLoss(as.numeric(bank_eval$pred_prob), as.numeric(bank_data_test$`signed up`))
## [1] 2.135454

Also deviating far from zero, once again, we have another metric that has demonstrated that our model is not performing particularly well. We now will create an ROC curve and AUC value to analyze how we can potentially alter the threshold to balance our TPR and FPR.

Creating the ROC and Calculating the AUC

bank_pred <- prediction(bank_eval$pred_prob, bank_eval$target)
bank_roc <- performance(bank_pred, "tpr", "fpr")
plot(bank_roc, colorize = TRUE)

auc_value <- performance(bank_pred, 'auc')
paste("AUC: ", auc_value@y.values)
## [1] "AUC:  0.692644698050978"

Looking at the ROC curve, we see that we are not performing terribly far from a random guess line. Our calculation of the AUC further bolsters this, giving us a value of 0.6926, which is not far off from the random guess AUC value of 0.5. Our model is performing moderate to poor, and thus should be a candidate for improvement.

To improve our model, we should look at a different threshold value. Looking at the ROC curve, we see that it appears that at a threshold of approximately 0.3-0.4, we are achieving a TRP of approximately 60% while controlling for our FPR. Thus, to try to improve our model, we will change the threshold to be approximately 0.35, to see how this affects our model, namely the true positive rate which is what we are most interested in.

Using a New Threshold

adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=test_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

adjust_thres(bank_eval$pred_prob, 0.35, as.factor(bank_eval$target))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 7290  780
##          1  402  254
##                                           
##                Accuracy : 0.8645          
##                  95% CI : (0.8572, 0.8717)
##     No Information Rate : 0.8815          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2297          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.24565         
##             Specificity : 0.94774         
##          Pos Pred Value : 0.38720         
##          Neg Pred Value : 0.90335         
##               Precision : 0.38720         
##                  Recall : 0.24565         
##                      F1 : 0.30059         
##              Prevalence : 0.11850         
##          Detection Rate : 0.02911         
##    Detection Prevalence : 0.07518         
##       Balanced Accuracy : 0.59669         
##                                           
##        'Positive' Class : 1               
## 

When analyzing the output, it is easiest to start with the accuracy, which is fairly high at 86.45%. However, when we move on to look at the sensitivity and specificity, we notice there is a better story to be told. Although we tout a high accuracy, we see that is a grave imbalance in our sensitivity and specificity. Although the true negative rate, or specificity, is extremely high at 94.77%, we see that our true positive rate, or sensitivity, is extremely low at 24.57%. Arguably, given the context of the problem, it is more advantageous to maximize the sensitivity since it is more desirable to correctly identify those who would sign up versus being able to accurately predict those who would not sign up. This is why we must look at different metrics other than overall accuracy; our accuracy is so high because we are highly accurate in predicting those who would not sign up (the 0’s), but are performing horribly in predicting the group we care more about. Furthermore, the likely explanation for this phenomenon is due to the fact that we have an imbalanced sample, as exhibited by the prevalence value. Our prevalence value tells us that only 11.85% of the sample data are those who actually signed up (class 1), meaning that it intuitively makes sense that the model will not be as good as predicting that class since it simply has much less data to train on. We see this also effecting the results of the pos pred value and neg pred value, which tells us that when we predicted the class, what is the probability the class we predicted is the true class. Once again, with these, we find that the neg pred value is higher, with the pos pred value is low.

Looking at another metric, we can assess our kappa value, which acts almost as a grade of our model. At a value of 0.2297, which we can further interpret as a very poor grade of our model. This means that our model is performing very poorly, which bolsters our findings from the other metrics discussed previously.

Since we are looking at imbalanced data, we should also look at the F1 score. Since we need the false positive and false negative rate to calculate our F1 score, we also will comment on these metrics as well.

Confusion Matrix Visual

# Confusion matrix with labels 
#Using the caret library 
lvs <- c("Wouldn't Sign Up","Would Sign Up")
truth <- factor(rep(lvs, times = c(1034, 7692)),
levels = rev(lvs))
pred <- factor(
c(rep(lvs, times = c(254, 780)), rep(lvs, times = c(402, 7290))), levels = rev(lvs))
xtab <- table(pred, truth)
cm <- confusionMatrix(pred, truth)
cm$table
##                   Reference
## Prediction         Would Sign Up Wouldn't Sign Up
##   Would Sign Up             7290              780
##   Wouldn't Sign Up           402              254
fourfoldplot(cm$table)

Calculating the F1 Score, FPR, and FNR

tp <- 0.24565
fp <- 402 / (402 + 254)
fn <- 780 / (780 + 7290)
bank_F1 <- tp / (tp + (0.5*(fn + fp)))
paste("F1 score: ", toString(bank_F1))
## [1] "F1 score:  0.409157822128228"
paste("False Positive Rate: ", toString(fp))
## [1] "False Positive Rate:  0.61280487804878"
paste("False Negative Rate: ", toString(fn))
## [1] "False Negative Rate:  0.0966542750929368"

Our F1 score will give us the balance between precision and recall, also defined as a harmonic average between the two. A “good” F1 score will tell us that there is a good balance between precision and recall, and will converge towards 1. The F1 score takes on continuous values between 0 and 1, with a more desirable score closer to 1 as explained before. Upon calculating our F1 score, we find it to be 0.4092, indicating our balance between precision and recall is relatively poor. Furthermore, we look at the FPR and the FNR, which are 61.28% and 9.67% respectively. In this instance, we see that we are very rarely misclassifying someone as a zero, but are misclassifying more than half of the time someone as a 1, which could be extremely costly. Identifying someone as an individual who would sign up when in reality they wouldn’t has cost implications, as you are allocating money and resources to capture a customer that will not sign up. We also want to maintain a low false negative rate though, however, as misclassifying someone as not signing up even though they would sign up means that you are missing out on capturing revenue from a willing customer. Since neither, arguably, are more detrimental than the other, our primary focus should be to find a balance between the two, and attempt to maximize the TPR, which we will attenpt to do with an ROC curve.

Now we will take a look at bias; if our model is unbiased, this metric will be closer to zero.

bias(as.numeric(bank_data_test$`signed up`), as.numeric(bank_eval$pred_class))
## [1] -0.9566812

This metric deviates pretty substantially from zero, and thus we can say the model is likely biased. Next, and finally, we will take a look at our log loss value. Ideally, the value will be close to zero.

LogLoss(as.numeric(bank_eval$pred_prob), as.numeric(bank_data_test$`signed up`))
## [1] 2.135454

Also deviating far from zero, once again, we have another metric that has demonstrated that our model is not performing particularly well. We now will create an ROC curve and AUC value to analyze how we can potentially alter the threshold to balance our TPR and FPR.

Creating the ROC and Calculating the AUC

When looking at the ROC and AUC values, it is important to understand how they function together as well as what a “good” vs “bad” ROC curve looks like. An ROC Curve is simply the FPR and TPR plotted against one another. Below, we have an example graph showing the differences between good and bad, which we will use for the bank and the commercial models to compare our model to the general representation of the ROC curve.

Showing an ROC Curve.

prob_knn <- tibble(bank_3NN, attributes(bank_3NN)$prob)
prob_knn$prob <- if_else(prob_knn$bank_3NN == 0,
                         1-prob_knn$`attributes(bank_3NN)$prob`, prob_knn$`attributes(bank_3NN)$prob`) #### this is a example of converting the probabilities to the correct format.

bank_eval <- data.frame(pred_class = bank_3NN, 
                        pred_prob = prob_knn$prob, 
                        target = as.numeric(bank_data_test$`signed up`))

bank_pred <- prediction(bank_eval$pred_prob, bank_eval$target)
bank_roc <- performance(bank_pred, "tpr", "fpr")
plot(bank_roc, colorize = TRUE)

auc_value <- performance(bank_pred, 'auc')
paste("AUC: ", auc_value@y.values)
## [1] "AUC:  0.692644698050978"

Looking at the ROC curve, we see that we are not performing terribly far from a random guess line. Our calculation of the AUC further bolsters this, giving us a value of 0.6926, which is not far off from the random guess AUC value of 0.5. Our model is performing moderate to poor, and thus should be a candidate for improvement.

To improve our model, we should look at a different threshold value. Looking at the ROC curve, we see that it appears that at a threshold of approximately 0.3-0.4, we are achieving a TPR of approximately 60% while controlling for our FNR Thus, to try to improve our model, we will change the threshold to be approximately 0.3, to see how this affects our model, namely the true positive rate which is what we are most interested in.

Using a New Threshold

adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=test_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

adjust_thres(bank_eval$pred_prob, 0.3, as.factor(bank_eval$target))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 6060  440
##          1 1632  594
##                                           
##                Accuracy : 0.7625          
##                  95% CI : (0.7535, 0.7714)
##     No Information Rate : 0.8815          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2417          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.57447         
##             Specificity : 0.78783         
##          Pos Pred Value : 0.26685         
##          Neg Pred Value : 0.93231         
##               Precision : 0.26685         
##                  Recall : 0.57447         
##                      F1 : 0.36442         
##              Prevalence : 0.11850         
##          Detection Rate : 0.06807         
##    Detection Prevalence : 0.25510         
##       Balanced Accuracy : 0.68115         
##                                           
##        'Positive' Class : 1               
## 

Looking at this new threshold, we should note a few things. First thing, we see a non-trivial decrease in our accuracy; however, more notably, and more importantly, we see that we have great improvement in our sensitivity rate, which is what we were aiming to increase. Compared to our preliminary model which had a sensitivity (TPR) of only 24.57%, we now have greatly improved this metric to be 57.44%. However, consequently, as a trade-off, our specificity must decrease to 78.78%. This should not pose as a serious issue, as our sensitivity in practice is much more valuable than the specificity. Furthermore, looking at our kappa value, we do not see a drastic increase or improvement in this value. This still indicates that our model is performing moderate to poorly. Another thing to note is that our model appears to have a pretty non-trivial false positive rate. This could, once again, have monetary implications, since we are utilizing expenditures to capture customers that actually will not sign up.

For further improvements, the main issue we are facing when constructing our model is the grave imbalance of data. Since we have such a small sample data for the variable of interest, it is quite difficult to statistically or numerical improve our model in any substantial way since our model is facing a systemic issue. To mitigate this issue, our biggest and primary recommendation is to attempt to collect more data on class 1, or those that signed up. We recognize that this may be a difficult endeavor, as it may simply just require time.

Evaluation: TV Commercial Model

Data Setup

set.seed(3)
tv_comm <- read.csv('tv_commercial_datasets_CNN_Cleaned.csv', stringsAsFactors = FALSE)
cnn <- t(read.csv('cnn_commmercial_label.csv', header = FALSE))
colnames(tv_comm) <- cnn
tv_comm2 <- tv_comm %>%
  select(-ends_with('var'))
scale_comm <- scale(tv_comm2[ ,1:(ncol(tv_comm2)-1)])
tv_comm2 <- cbind(scale_comm, Commercial = tv_comm2[ ,ncol(tv_comm2)])
subset_knn <- tv_comm2[ ,-c(2,3,6,7,8)]
sample.data <- sample(1:nrow(subset_knn),
                              round(0.7 * nrow(subset_knn), 0),  #<- multiply the number of rows by 0.7 and round the decimals
                              replace = FALSE)
knn_train <- subset_knn[sample.data, ]
knn_test <- subset_knn[-sample.data, ]
knn_train <- subset_knn[sample.data, ]
knn_test <- subset_knn[-sample.data, ]
tv_3NN <-  knn(train = knn_train[ ,-ncol(knn_train)],
               test = knn_test[ ,-ncol(knn_train)],  
               cl = knn_train[, "Commercial"],
               k = 3,
               use.all = TRUE,
               prob = TRUE)

Creating a Confusion Matrix

conf_mat <- table(tv_3NN, knn_test[ ,"Commercial"])
conf_mat
##       
## tv_3NN   -1    1
##     -1 1428  688
##     1  1102 3546

We can choose to create a basic confusion matrix using the table function, which allows us to manually calculate the error rate, true positive rate, true negative rate, false negative rate, and false positive rate.However, just like with the Bank example, we can instead use the confusion matrix function from the caret function to calculate these values for us. It also will provide further insight on some other baseline and evaluation metrics.

confusionMatrix(as.factor(tv_3NN), as.factor(knn_test[ ,'Commercial']), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1428  688
##         1  1102 3546
##                                           
##                Accuracy : 0.7354          
##                  95% CI : (0.7247, 0.7458)
##     No Information Rate : 0.626           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4156          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8375          
##             Specificity : 0.5644          
##          Pos Pred Value : 0.7629          
##          Neg Pred Value : 0.6749          
##              Prevalence : 0.6260          
##          Detection Rate : 0.5242          
##    Detection Prevalence : 0.6872          
##       Balanced Accuracy : 0.7010          
##                                           
##        'Positive' Class : 1               
## 

In this case, we can begin by looking at the accuracy rate (which we notice to be fairly high at 73.54%). We can then look at sensitivity and specificity; these two values are critical in getting an understanding for the model. Sensitivity allows us to understand our true positive rate whereas specificity allows us to understand our true negative rate. These are both values we want to maximize; we can see here that our sensitivity is fairly high at 83.75%; however, our specificity is slightly lower at 56.44%. Although both sensitivity and specificity are critical in our understanding of how the model functions, in the context that we are given, we could argue that maximizing the specificity is more desired since our initial intent was to be able to correctly identify non-commercials vs being able to correctly identify commercials. In the case with the tv/commercial model we analyzed last class, our goal was to make the company’s commercials seem more like actual TV show.

Looking at different metrics besides our overall accuracy is critical because it allows us to get a different perspective on the model and how it functions. Our accuracy for this model is high because we are highly accurate in predicting non-commercials (the 1’s), and we are fairly okay with predicting commercials (the 0’s). In this case, the prevalence value is 62.60% which means that 62.60% of the sample data are those that are actually non-commercials. Unlike the bank model, the sample is more balanced in this scenario (yet still imbalanced). In a perfect world, it would be more desirable to have 50% for the prevalence rate. However, the prevalence rate does explain why our sensitivity value is high because looking at the prevalence value we can see that intuitively our model would be better at predicting non-commercials.

To continue, we can look at our kappa value, which acts almost as a grade of our model. In this model, we have a kappa value of 0.4156, which we can interpret as a moderate grade of our model. This means that our model is performing moderately, which Looking at another metric, we can assess our kappa value, which acts almost as a grade of our model. At a value of 0.2297, which we can further interpret as a very poor grade of our model. This means that our model is performing very poorly, which reinforces our findings from the other metrics discussed previously.

Since we are looking at imbalanced data, we should also look at the F1 score. Since we need the false positive and false negative rate to calculate our F1 score, we also will comment on these metrics as well.

Confusion Matrix Visual

# Confusion matrix with labels 
#Using the caret library 
lvs <- c("Commercial","Non Commercial")
truth <- factor(rep(lvs, times = c(4234, 2530)),
levels = rev(lvs))
pred <- factor(
c(rep(lvs, times = c(3546, 688)), rep(lvs, times = c(1102, 1428))), levels = rev(lvs))
xtab <- table(pred, truth)
cm <- confusionMatrix(pred, truth)
cm$table
##                 Reference
## Prediction       Non Commercial Commercial
##   Non Commercial           1428        688
##   Commercial               1102       3546
fourfoldplot(cm$table)

Calculating the F1 Score, FPR, and FNR

tp2 <- 0.8375
fp2 <- 1102 / (1102 + 3546)
fn2 <- 688 / (688 + 1428)
tv_F1 <- tp2 / (tp2 + (0.5*(fn2 + fp2)))
paste("F1 score: ", toString(tv_F1))
## [1] "F1 score:  0.748692693506757"
paste("False Positive Rate: ", toString(fp2))
## [1] "False Positive Rate:  0.237091222030981"
paste("False Negative Rate: ", toString(fn2))
## [1] "False Negative Rate:  0.325141776937618"

We found an F1-Score of 0.7487 (rounded). As we want an F-Score to be as close to 1 as possible as it measures the balance between precision and recall. Our F-Score is actually fairly good. However, ideally, we would still want this F1-Score, and consequently the balance, to be much higher to have a greater balance between precision and recall.

prob_knn2 <- tibble(tv_3NN, attributes(tv_3NN)$prob)
prob_knn2 <- as.data.frame(prob_knn2)
prob_knn2$prob <- if_else(prob_knn2$tv_3NN == -1,
                         1-prob_knn2$`attributes(tv_3NN)$prob`, prob_knn2$`attributes(tv_3NN)$prob`) 

tv_eval <- data.frame(pred_class = tv_3NN, 
                        pred_prob = prob_knn2$prob, 
                        target = as.numeric(knn_test[ ,"Commercial"]))

bias(as.numeric(knn_test[,"Commercial"]),
     as.numeric(tv_eval$pred_class))
## [1] -1.435245

We can see that this metric has a value of -1.435245, which, similar to the bank model, deviates substantially from 0; as a result, we are able to say that the model is likely biased. We can continue on to the log loss value to get a better sense of our data.

LogLoss(as.numeric(tv_eval$pred_prob), 
        as.numeric(knn_test[ ,"Commercial"]))
## [1] 1.971747

Ideally, we also want our log loss value to be close to 0 as possible. We can see here that this is not the case. Once again, we have another metric that has demonstrated that our model is not performing particularly well. We can move on to the ROC curve and AUC value to analyze different ways we can potentially alter the threshold to balance our TPR and FPR.

Creating the ROC and Calculating the AUC

tv_pred <- prediction(tv_eval$pred_prob, tv_eval$target)
tv_roc <- performance(tv_pred, "tpr", "fpr")
plot(tv_roc, colorize = TRUE)

auc_value2 <- performance(tv_pred, 'auc')
paste("AUC: ", auc_value2@y.values)
## [1] "AUC:  0.749691048000284"

We can see here that we have a relatively well performing ROC curve that is not close to the random guess line. Our calculation of AUC allows us to better understand how our ROC curve performs. We have an AUC value of around 0.7497. A value close to .5 would represent a bad classifier, whereas a score of 1 would represent a good classifier. Here, we can see that a value of .7497 is closer to 1 than it is to .5, so we can say that we have “acceptable discrimination” for the classifier. There is, however, room for improvement.

To improve our model, we should look at a different threshold value. Looking at the ROC curve, we see that it appears that at a threshold of approximately .6, we are achieving a TPR of approximately 80% while controlling for our FPR. Thus, to try to improve our model, we will change the threshold to be approximately 0.6, to see how this affects our model, namely the true positive rate which is what we are most interested in.

Using a New Threshold

thres <- as.factor(ifelse(tv_eval$pred_prob > 0.6, 1, -1))
confusionMatrix(thres, as.factor(tv_eval$target), positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1430  690
##         1  1100 3544
##                                           
##                Accuracy : 0.7354          
##                  95% CI : (0.7247, 0.7458)
##     No Information Rate : 0.626           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4158          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8370          
##             Specificity : 0.5652          
##          Pos Pred Value : 0.7631          
##          Neg Pred Value : 0.6745          
##               Precision : 0.7631          
##                  Recall : 0.8370          
##                      F1 : 0.7984          
##              Prevalence : 0.6260          
##          Detection Rate : 0.5240          
##    Detection Prevalence : 0.6866          
##       Balanced Accuracy : 0.7011          
##                                           
##        'Positive' Class : 1               
## 

Looking at the new threshold, there are a few important things to analyze. Initially, we can see that the remains 73.54% in both of the models. Looking at the specificity and the sensitivity, we can see that the specificity has gone up from 56.44% to 56.52%, which is what we were aiming to increase as it shows the predicted non-commercials correctly identified as non-commercials.We can see that the sensitivity slightly decreased from 83.75% to 83.70%. Again, like with the bank model, we can see a trade-off between increasing the rate for one metric, which decreases the rate for another. Our new threshold did not improve the model as significantly as we had hoped; for example, in our bank model, we saw how the threshold was able to drastically improve our model.

Looking at the kappa value, we we see a value of .4158, which is not drastically different than our old kappa value of .4156. This still indicates that our model is performing moderate.

One thing that could be improved for the future could be the prevalence rate. Although the two models for the commercials both had a better prevalence rate than the bank models, we could still improve it by attempting to collect more data on class 1, or non-commercial sample points. By getting the prevalence rate closer to 1, we can ensure that our data is more balanced. However, a rate of .6260 is still a moderate/good prevalence value. We believe that the model is doing well as is.