DATASET 1 (INSTAGRAM SPAMMER DATA)

Initial Confusion Matrix

##               
## subsetted2_3NN  0  1
##              0 48  8
##              1  9 50

## [1] 48 50

## [1] 0.8521739

## [1] 0.862069

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 48  8
##          1  9 50
##                                           
##                Accuracy : 0.8522          
##                  95% CI : (0.7739, 0.9115)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : 5.11e-15        
##                                           
##                   Kappa : 0.7043          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.8475          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4348          
##    Detection Prevalence : 0.5130          
##       Balanced Accuracy : 0.8521          
##                                           
##        'Positive' Class : 1               
##

##     k  accuracy
## 1   1 0.8521739
## 2   3 0.8521739
## 3   5 0.8608696
## 4   7 0.8782609
## 5   9 0.8521739
## 6  11 0.8521739
## 7  13 0.8608696
## 8  15 0.8434783
## 9  17 0.8347826
## 10 19 0.8434783
## 11 21 0.8521739

We can conclude for the above graph and data frame output that the optimal k-value will be 7. So for accuracy, we can conclude that k=7 is the point at which accuracy peaks.

Re-analyzing at “optimal” k-value of k=7

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 50  7
##          1  7 51
##                                           
##                Accuracy : 0.8783          
##                  95% CI : (0.8042, 0.9318)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7565          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.8772          
##          Pos Pred Value : 0.8793          
##          Neg Pred Value : 0.8772          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4435          
##    Detection Prevalence : 0.5043          
##       Balanced Accuracy : 0.8783          
##                                           
##        'Positive' Class : 1               
##

As we can see above, the accuracy went up from 85% to about 87%. The sensitivity (TPR) also increased from about 86% to 87.9%.

Now using KNN to analyze the Sensitivity

Sensitivity looks like it will peak around k=12 instead of at k=7. This follows the trend we have already observed which states that as k increases, both the accuracy and sensitivity of the classifier are maximized. Thus, to produce the best model possible, a high hyperparameter value should be selected for these two parameters.

Adjustment of Thresholds

## k-Nearest Neighbors 
## 
## 462 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 462, 462, 462, 462, 462, 462, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8587674  0.7169795
##   7  0.8683216  0.7361332
##   9  0.8721566  0.7437954
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 52  4
##          1  5 53
##                                           
##                Accuracy : 0.9211          
##                  95% CI : (0.8554, 0.9633)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8421          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9298          
##             Specificity : 0.9123          
##          Pos Pred Value : 0.9138          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4649          
##    Detection Prevalence : 0.5088          
##       Balanced Accuracy : 0.9211          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 52  4
##          1  5 53
##                                           
##                Accuracy : 0.9211          
##                  95% CI : (0.8554, 0.9633)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8421          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9298          
##             Specificity : 0.9123          
##          Pos Pred Value : 0.9138          
##          Neg Pred Value : 0.9286          
##               Precision : 0.9138          
##                  Recall : 0.9298          
##                      F1 : 0.9217          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4649          
##    Detection Prevalence : 0.5088          
##       Balanced Accuracy : 0.9211          
##                                           
##        'Positive' Class : 1               
##

Recommendation for Threshold Selection for Spammer Detector Classifier:

Accuracy drops to 89.47% and sensitivity increased to 94.74% (so at a low threshold of 30%, (50% is the default) the sensitivity metric of interest is increased. However, in the context of the data, while the TPR should be high in detecting spam accounts (1) it should not be too high as to flag every potential spam account as spam. This would significantly reduce the number of authentic, non-spam messages reaching users. Thus, it would also be helpful to find a threshold to increase the specificity, or the rate of all non-spam emails that were correctly identified as non-spam. While the specificity was 91% at 50% threshold, it’s only 84.21% at 30% threshold. Thus, it is favorable to increase the threshold only so much that sensitivity will increase, but that specificity will not decrease significantly. At a threshold of exactly 55%, the sensitivity is 92.98%, and the specificity is 91.23%. Thus, all three reach pretty reasonably high values. Moreover, the F-1 Score is relatively high at 92.17% and thus attests to a higher threshold at 55%. Overall, “raising the threshold” from 50% to 55% means that less observations will be classified as a 1/spam account, which may help to allow for more authentic messages to go through while still effectively blocking actualy spam content.

ROC Curves/AUC

## [[1]]
## [1] 0.9447522

The area beneath the ROC curve is 94%. This is a reasonably high AUC value, indicating that the classifier is relatively maximized. This shows that the true positive rate on the y-axis is maximized, and thus the sensitivity is maximized as desired.

Log Loss

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

## [1] 0.6017782

Log Loss Interpretation

Interpreting a high AUC but poor log loss and F-1: The logloss calculation is about 0.6, which correlates to about 70% chance of a correct classification. This is a reasonably good value, as log loss should be minimized because a high logloss indicates that even with other maximized metrics, the classifier is very incorrect when it does classify incorrectly.

DATASET 2 (DISEASE DATA: KNN FOLLOWED BY THE LOG-LOSS, ROC/AUC, F-1)

In the context of health data, sensitivity (TPR) is the most important metric, to classify the rate of correct classifications of the presence of a disease (out of all instance where the disease is truly present.) Sensitivity was analyzed along with several other metrics. # Data Cleaning

Initial Confusion Matrix

##               
## subsetted2_3NN  0  1
##              0 60 18
##              1  8 28

## [1] 60 28

## [1] 0.7719298

## [1] 0.6086957

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 60 18
##          1  8 28
##                                          
##                Accuracy : 0.7719         
##                  95% CI : (0.684, 0.8453)
##     No Information Rate : 0.5965         
##     P-Value [Acc > NIR] : 5.911e-05      
##                                          
##                   Kappa : 0.5089         
##                                          
##  Mcnemar's Test P-Value : 0.07756        
##                                          
##             Sensitivity : 0.6087         
##             Specificity : 0.8824         
##          Pos Pred Value : 0.7778         
##          Neg Pred Value : 0.7692         
##              Prevalence : 0.4035         
##          Detection Rate : 0.2456         
##    Detection Prevalence : 0.3158         
##       Balanced Accuracy : 0.7455         
##                                          
##        'Positive' Class : 1              
##

Choosing an Optimal k Hyperparameter

##     k  accuracy
## 1   1 0.8157895
## 2   3 0.7719298
## 3   5 0.7456140
## 4   7 0.7192982
## 5   9 0.6754386
## 6  11 0.6842105
## 7  13 0.6929825
## 8  15 0.6929825
## 9  17 0.6842105
## 10 19 0.6754386
## 11 21 0.6666667

As demonstrated by both the output from the accuracy table and the graphic, k=1 is the optimal hyperparameter with which to measure accuracy.

Re-analyzing at “optimal” k-value of k=1

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 63 16
##          1  5 30
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7323, 0.8822)
##     No Information Rate : 0.5965          
##     P-Value [Acc > NIR] : 4.534e-07       
##                                           
##                   Kappa : 0.6019          
##                                           
##  Mcnemar's Test P-Value : 0.0291          
##                                           
##             Sensitivity : 0.6522          
##             Specificity : 0.9265          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7975          
##              Prevalence : 0.4035          
##          Detection Rate : 0.2632          
##    Detection Prevalence : 0.3070          
##       Balanced Accuracy : 0.7893          
##                                           
##        'Positive' Class : 1               
##

As demonstrated, the accuracy AND sensitivity at k=1 are maximized in the context of the data. The accuracy increased from 77% to 81.5%, while the sensitivity increased from 60% to 65%. Thus, when k=smaller values, the classifier is maximally effective for these chosen metrics.

Now using KNN to analyze the Sensitivity

The graph reinforces the idea that at k=1, sensitivity is maximized at around 0.65. Thus, k=1 is the chosen hyperparameter to maximize both accuracy and sensitivity.

Adjustment of Thresholds

## k-Nearest Neighbors 
## 
## 456 samples
##  31 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 456, 456, 456, 456, 456, 456, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7151613  0.3619684
##   7  0.7032444  0.3318062
##   9  0.6923029  0.3046253
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 62 19
##          1  9 23
##                                           
##                Accuracy : 0.7522          
##                  95% CI : (0.6622, 0.8286)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 0.00353         
##                                           
##                   Kappa : 0.4424          
##                                           
##  Mcnemar's Test P-Value : 0.08897         
##                                           
##             Sensitivity : 0.5476          
##             Specificity : 0.8732          
##          Pos Pred Value : 0.7188          
##          Neg Pred Value : 0.7654          
##              Prevalence : 0.3717          
##          Detection Rate : 0.2035          
##    Detection Prevalence : 0.2832          
##       Balanced Accuracy : 0.7104          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 26  2
##          1 45 40
##                                          
##                Accuracy : 0.5841         
##                  95% CI : (0.4876, 0.676)
##     No Information Rate : 0.6283         
##     P-Value [Acc > NIR] : 0.8576         
##                                          
##                   Kappa : 0.2635         
##                                          
##  Mcnemar's Test P-Value : 8.993e-10      
##                                          
##             Sensitivity : 0.9524         
##             Specificity : 0.3662         
##          Pos Pred Value : 0.4706         
##          Neg Pred Value : 0.9286         
##               Precision : 0.4706         
##                  Recall : 0.9524         
##                      F1 : 0.6299         
##              Prevalence : 0.3717         
##          Detection Rate : 0.3540         
##    Detection Prevalence : 0.7522         
##       Balanced Accuracy : 0.6593         
##                                          
##        'Positive' Class : 1              
##

Recommendation for Threshold Selection for Breast Cancer Detection Classifier:

With breast cancer data where a diagnosis=1 indicates a malignant classification, it is best to maximize the sensitivity (TPR) of the model, which means maximizing the rate at which the classifier correctly identifies the presence of the malignant type. In much of healthcare, sensitivity is the desired metric to be maximized. At a threshold default of 50%, the sensitivity is merely 54.76%. However, upon “lower the threshold” to 20%, the sensitivity is increased to 95.24%. Thus, when more observations are eligible to be classified as malignant, the true positive rate increases. A recommendation would be to lower the threshold for such a classifier in order to increase the sensitivity to its maximum. Also, in the context of health data, the sensitivity should be maximized as much as possible because of the heavy consequences associated with false negatives.

ROC Curves/AUC

## [[1]]
## [1] 0.8262911

The area beneath the ROC curve is 82.6%. This is a reasonably high AUC value, indicating that the classifier is relatively maximized. This shows that the true positive rate on the y-axis is maximized, and thus the sensitivity is maximized as desired.

Log Loss

## [1] 6.588941

Log Loss Interpretation

The log loss at 6.5 is extremely high, and indicates that the classifier will be very off when predicting values incorrectly, even though the sensitivity may be maximized. This high log loss may also be due to the fact that the dataset is only about 500 rows, and so a small number of observations may skew log loss despite other high desired metrics.

Sources

https://www.kaggle.com/free4ever1/instagram-fake-spammer-genuine-accounts?select=train.csv [used the training dataset]
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Evaluation Metrics

Brittany Nguyen

4/19/2021

DATASET 1 (INSTAGRAM SPAMMER DATA)

Initial Confusion Matrix

Re-analyzing at “optimal” k-value of k=7

Now using KNN to analyze the Sensitivity

Adjustment of Thresholds

Recommendation for Threshold Selection for Spammer Detector Classifier:

ROC Curves/AUC

Log Loss

Log Loss Interpretation

DATASET 2 (DISEASE DATA: KNN FOLLOWED BY THE LOG-LOSS, ROC/AUC, F-1)

Initial Confusion Matrix

Choosing an Optimal k Hyperparameter

Re-analyzing at “optimal” k-value of k=1

Now using KNN to analyze the Sensitivity

Adjustment of Thresholds

Recommendation for Threshold Selection for Breast Cancer Detection Classifier:

ROC Curves/AUC

Log Loss

Log Loss Interpretation

Sources