Machine Learning Evaluation - Customer and Commercial Classification

Objective

Throughout your early career as a data scientist whether that was exploring NBA talent, guiding climate change policy investment or better understanding how to create better commercials you’ve suddenly realized you need to enhance your ability to assess the models you are building. As the most important part about understanding any machine learning model is understanding it’s weakness or better said it’s vulnerabilities.

In doing so you’ve decided to revisit your last consulting gigs and gather a sense of how to discuss the good and the bad of these outcomes.

Customer Classification Analysis

The goal of this kNN model is to predict whether a potential customer will subscribe to their service. When thinking about what metrics might be useful for this case, I believe sensitivity (true positive rate) and log loss would provide the most valuable information, as the client will want a model that is great at classifying prospective customers (true positive rate) and they wouldn’t want a wrong classification with high confidence (log loss) because that could translate to a loss in revenue.

Building the Model

Calculating base rate of classifying a customer:

This means that at random, we have an 11.6% chance of correctly picking out a subscribed individual.

After building the model using kNN and using a function to assess the optimal k number of neighbors, this elbow plot is created:

The marginal improvement of the accuracy drops at a k value of 5-7, so we’ll run our model with k=5 nearest neighbors. The confusion matrix for k=5 is below:

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 7391  788
##          1  301  246
##                                           
##                Accuracy : 0.8752          
##                  95% CI : (0.8681, 0.8821)
##     No Information Rate : 0.8815          
##     P-Value [Acc > NIR] : 0.9663          
##                                           
##                   Kappa : 0.2497          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.23791         
##             Specificity : 0.96087         
##          Pos Pred Value : 0.44973         
##          Neg Pred Value : 0.90366         
##              Prevalence : 0.11850         
##          Detection Rate : 0.02819         
##    Detection Prevalence : 0.06269         
##       Balanced Accuracy : 0.59939         
##                                           
##        'Positive' Class : 1               
##

Metrics Analysis

Accuracy = 87.52%
Sensitivity (true positive rate) = 23.79%
False positive rate = 3.91%
F1 Score = 0.31
Kappa = 0.2497
AUC = 0.728
Bias = -0.944
Log Loss = 1.505

From the confusion matrix above, the accuracy of our model with k=5 is 87.52%. This is great for the accuracy statistic, however it may give a biased representation to correctly classifying positive cases or negative cases, as it takes both into account. To dig deeper, we’ll look at the true positive rate (or sensitivity) and false positive rate (1-specificity).

The sensitivity is 23.79% which is very poor. On the other hand, the false positive rate is 3.91% which is excellent. Those metrics basically tell us the model is terrible at correctly classifying a customer who will sign up, but is excellent at classifying when a customer won’t. Therefore, applying this model in real-life might not be a good idea if the bank wants to find new customers; however, if they for some reason want to classify who isn’t a good fit for a customer, this model is excellent at that.

The F1 score (a measure of accuracy that is the harmonic mean of precision and recall) is 0.31 which is poor. The F1 score takes into account the precision predicting positive outcomes, and the proportion of actual positive correct outcomes, therefore, because the F1 is poor we have more confirmation that our model is deficient at classifying positive outcomes (a prospective customer).

Another metric, Kappa, which is indicates how much better our classifier is performing over the performance of a classifier that would just guess at random, is equal to 0.2497. That is also poor, indicating our model is not much better than simply guessing at random (for classiying positive cases).

Another useful metric uses an ROC (receiver operating curve) which plots the sensitivity versus specificity at varying cutoff thresholds (the probabilistic threshold the model uses to classify a case as positive). An AUC (area under curve) value is calculated based on the ROC, and the value for this model is 0.728 which is a fair rating as we want the value to be >0.8.

The bias metric is -0.944 which is poor, as an unbiased model would have a reading of zero. The model bias tends to classify an outcome as a non-customer, which is consistent with the very low false positive rate.

Log Loss is another useful metric that measures the uncertainty of the probabilities associated with the model and compares them to the actual classifications. This is beneficial because it heavily penalizes instances of high confidence (high probability) in classifying a value incorrectly. The value for this model is 1.505, which is poor as we want a number close to zero. This means there is much uncertainty in our model’s probabilities for classification.

After assessing the above metrics, because the goal of the model is to predict potential customers (the positive case), the metrics that we care about the most include:

Sensitivity (high)
F1 Score (high)
False Positive Rate (low)

To try and reach the goals above, we can change the threshold at which the model will proceed with a classification (typically is 0.5). By referencing the ROC curve to gauge the performance of other thresholds, if we lower the threshold to around 0.1, the model performs much more balanced than using the previous threshold of 0.5. A new confusion matrix is produced to reflect the new model:

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 5383  317
##          1 2309  717
##                                           
##                Accuracy : 0.6991          
##                  95% CI : (0.6893, 0.7087)
##     No Information Rate : 0.8815          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2144          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.69342         
##             Specificity : 0.69982         
##          Pos Pred Value : 0.23695         
##          Neg Pred Value : 0.94439         
##               Precision : 0.23695         
##                  Recall : 0.69342         
##                      F1 : 0.35320         
##              Prevalence : 0.11850         
##          Detection Rate : 0.08217         
##    Detection Prevalence : 0.34678         
##       Balanced Accuracy : 0.69662         
##                                           
##        'Positive' Class : 1               
##

Although the accuracy drops to ~70%, the sensitivity (previously ~24%) increases to ~70% and the specificity (previously ~96%) drops to ~70% which isn’t too bad. The primary advantage to lowering the threshold is increasing the true positive rate, meaning more correct classifications for potential customers.

Summary of Findings

Overall, based on the above metrics, it is evident that our model is deficient at predicting customers who are likely to sign up, but is excellent at classifying customers who won’t. This is primarily due to an unbalanced dataset that contains many more negative cases (customers who didn’t sign up) than positive cases (customers who did sign up); in the dataset, only ~5,000/43,000 signed up, so the model training was very unbalanced, leading to a poor rate of classifying prospective customers. If we lower the cutoff threshold of the model to 0.1, we can increase the true positive rate to ~70%, with the sacrifice of a higher false positive rate.

I would not recommend the bank to use this model (unless they only want to classify customers who will not sign up), but if they obtain a balanced dataset with more “signed up” cases, the model could be re-trained and would likely increase its performance for classifying customers who will sign up for their program. If they want to proceed with the model, I would recommend changing the threshold to 0.1, however, the model still is not an excellent classifier for finding potential customers due to an average true positive rate and ~30% false positive rate.

Commercial Classification Analysis

The goal of this kNN model is to correctly classify if a TV “snippet” is a commercial or a non-commercial.

Building the Model

Calculating base rate of classifying a commercial:

This means that at random, we have an 63.92% chance of correctly classifying a commercial.

After building the model using kNN and using a function to assess the optimal k number of neighbors, this elbow plot is created:

Although the accuracy marginally loses improvement at k=5, the peak of the curve that gives us the highest accuracy is at k=11; therefore, the optimal k option for our model is k = 11.

The confusion matrix for k=11 is below:

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1283  497
##         1  1103 3881
##                                           
##                Accuracy : 0.7635          
##                  95% CI : (0.7531, 0.7735)
##     No Information Rate : 0.6473          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4502          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8865          
##             Specificity : 0.5377          
##          Pos Pred Value : 0.7787          
##          Neg Pred Value : 0.7208          
##              Prevalence : 0.6473          
##          Detection Rate : 0.5738          
##    Detection Prevalence : 0.7368          
##       Balanced Accuracy : 0.7121          
##                                           
##        'Positive' Class : 1               
##

Metrics Analysis

Accuracy = 76.35%
Sensitivity (true positive rate) = 88.65%
False positive rate = 46.23%
F1 Score = 0.8291
Kappa = 0.4502
AUC = 0.778
Bias = -1.442
Log Loss = -0.0719

When first looking at the metrics the confusion matrix outputs, the accuracy is fair with a value of 76.35%, and the sensitivity is great with a value of 88.65%. However, the false positive rate, 46.23%, is poor. This tells us the model is excellent at classifying positive outcomes (commercials), but is probably because it tends to classify the outcome a commercial. So far the model is looking okay, but let’s dig a little deeper.

After calculating the F1 score and Kappa, the model is further validated. The F1 comes out to 0.8291, which again tells us the model is excellent at predicting positive outcomes (commercials). The Kappa statistic is 0.4502, while not great, is an okay measure and tells us our model is fairly better at predicting commercials versus non-commercials than if you were to guess at random.

The AUC value derived from the ROC curve is 0.778 which is a good measure, but is most likely negatively impacted by a high false positive rate. Based on the metrics so far, it’s evident that the model is biased towards classifying a commercial rather than non-commercial, which is likely due to an unbalanced data set with ~14.4k/22.5k outcomes being commercials. Let’s look at bias and log loss, then see if we can lower the false positive rate by changing the threshold of the model.

The bias of the model is -1.442 which is a poor reading and tells us the model is fairly biased with a tendency to classify an outcome as a commercial over a non-commercial (consistent with the high false positive rate). Another useful metric, log loss, tells us if the model has high confidence in predicting classifications in the wrong direction. The value for this model is -0.0719 which is excellent. This means that when the model incorrectly classifies a case, it rarely is incorrect with high confidence in its prediction.

After assessing the above metrics, because the TV provider wants to classify commercials (the positive case), these metrics are the most important:

Sensitivity (high)
False Positive Rate (low)
F1 Score (high)

Finally, let’s adjust the probability threshold at which the model moves forward with a classification. By referencing the ROC curve we can find a threshold that lowers the false positive rate while keeping the true positive rate high. If we adjust the threshold to 0.65 this new confusion matrix is produced:

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1712 1302
##         1   674 3076
##                                           
##                Accuracy : 0.7079          
##                  95% CI : (0.6969, 0.7187)
##     No Information Rate : 0.6473          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3964          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7026          
##             Specificity : 0.7175          
##          Pos Pred Value : 0.8203          
##          Neg Pred Value : 0.5680          
##               Precision : 0.8203          
##                  Recall : 0.7026          
##                      F1 : 0.7569          
##              Prevalence : 0.6473          
##          Detection Rate : 0.4548          
##    Detection Prevalence : 0.5544          
##       Balanced Accuracy : 0.7101          
##                                           
##        'Positive' Class : 1               
##

The sensitivity drops to ~70% but the false positive rate decreases to ~30% which is much better than the previous value of ~47%. The precision and recall remain high (leads to a fair F1 score of 0.75), indicating the model is still great at predicting positive cases, while having a low false positive rate. This is what we want in this situation, as the TV provider wants to classify commercials (the positive case).

Summary of Findings

Overall, originally the model performed okay, with a high true positive rate but also a high false positive rate. By adjusting the threshold of the model to 0.65, the false positive rate dropped to ~30% and the true positive rate remained high at ~70%. The original error was likely due to an unbalanced dataset with ~14.4k/22.5k outcomes being commercials. Therefore, the TV provider could either re-train the model with a more balanced set (more non-commercial cases), or can proceed with the original model but with a changed threshold of 0.65 to minimize the false positive rate while maintaining decent positive classification metrics.

Machine Learning Evaluation - Customer and Commercial Classification

Alden Summerville

10/25/2020

Objective

Customer Classification Analysis

Building the Model

Metrics Analysis

Summary of Findings

Commercial Classification Analysis

Building the Model

Metrics Analysis

Summary of Findings