This data set involves people who are experiencing symptoms of COVID-19 such as fever, tiredness, dry-couch, sore-throat, etc., as well as physical characteristics such as sex and age. We want to apply to KNN to this dataset to determine if we can create a model that predicts whether the patient will have covid or not.
Data Source: https://github.com/nshomron/covidpred/tree/master/data
The dataset was mostly clean, but was recoded where appropriate. Furthermore, our dataset was too large so a subset was created to accommodate for the RAM size RStudio allocated to projects.
## the negative to positive covid result ratio is shown by: 48979 1021
test_indicationContact is chosen as the most important variable to improve our predictions.
## CART
##
## 40001 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 40001, 40001, 40001, 40001, 40001, 40001, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.9871196 0.6080399
## 0.00244798 0.9873291 0.6195929
## 0.38678091 0.9837316 0.3473416
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.00244798.
## Accuracy: 0.9873291
## test_indicationContact with confirmed
## 653.9973
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
## [1] 0.0120012
From the above we can see our True Positive Rate or sensitivity is at 54%, False Positive Rate (1-Specificity) is at 0.3%, we want this to be low. The accuracy is at 98.8%. And the error is at 1.2% Aside from sensitivity, these are all very good numbers indicating an accurate, low error, and low false-positive likely model.
We plotted the linear regression of the true positive rate against the false positive rate and color coded it.
## False Positive rate: .00276 or 27/9795
## True Positive Rate: .804 or 111/138
## LogLoss: 4.592146
## F1 Score: 0.993895
The LogLoss score is found to be 4.592146 while the F1 score is found to be 0.993895
The LogLoss ideally should be 0 so this is not a good score and indicates uncertainty in the data. This was also reflected in the previously found low sensitivity/true positive score.
The F1 score is derived from the confusion matrix which then generates a precision and sensitivity score that is then weighted and combined to form the F1 score This number is very good since the ideal value is 1 indicating there are low false positive and low false negatives which is further backed up by the previously found low false positive rate and low error rate.
When you look at the confusion matrix shown above, it becomes obvious that a significant portion of the people with positive covid test results are being classified as negative. Almost half of the test set are being classified as such. Why is this? We suspect this has likely to do with the most important variable that was previously singled out as test_indicationContact.
test_indicationContact with confirmed 653.9973
As you can see, if the person indicated that their reason for testing was that they had confirmed contact with a person that was infected with covid, then they are astronomically more likely to have contracted the disease themself, and there were no other relevant variables. So, based on those things, we are willing to bet that the misidentified people who got classified in the false negative category were misidentified because by the by their lack of contact with a covid-positive person.
These are all the same (Thresholds of 0 and 0.01)
adjust_thres(corona_eval_prob$`1`,0, test$corona_result)
## Warning in confusionMatrix.default(thres, z, positive = "1", dnn =
## c("Prediction", : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 0 0
## 1 9795 204
##
## Accuracy : 0.0204
## 95% CI : (0.0177, 0.0234)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.00000
## Specificity : 0.00000
## Pos Pred Value : 0.02040
## Neg Pred Value : NaN
## Precision : 0.02040
## Recall : 1.00000
## F1 : 0.03999
## Prevalence : 0.02040
## Detection Rate : 0.02040
## Detection Prevalence : 1.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
adjust_thres(corona_eval_prob$`1`,.01, test$corona_result)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Precision : 0.8043
## Recall : 0.5441
## F1 : 0.6491
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
These are all the same too
Threshold of 0.02
adjust_thres(corona_eval_prob$`1`,.02, test$corona_result)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Precision : 0.8043
## Recall : 0.5441
## F1 : 0.6491
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
Threshold of 0.1
adjust_thres(corona_eval_prob$`1`,.10, test$corona_result)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Precision : 0.8043
## Recall : 0.5441
## F1 : 0.6491
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
Threshold of 0.5
adjust_thres(corona_eval_prob$`1`,.50, test$corona_result)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Precision : 0.8043
## Recall : 0.5441
## F1 : 0.6491
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
Threshold of 0.75
adjust_thres(corona_eval_prob$`1`,.75, test$corona_result)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9768 93
## 1 27 111
##
## Accuracy : 0.988
## 95% CI : (0.9857, 0.99)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 8.997e-11
##
## Kappa : 0.6432
##
## Mcnemar's Test P-Value : 2.963e-09
##
## Sensitivity : 0.5441
## Specificity : 0.9972
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.9906
## Precision : 0.8043
## Recall : 0.5441
## F1 : 0.6491
## Prevalence : 0.0204
## Detection Rate : 0.0111
## Detection Prevalence : 0.0138
## Balanced Accuracy : 0.7707
##
## 'Positive' Class : 1
##
These two are also the same
adjust_thres(corona_eval_prob$`1`,.8, test$corona_result)
## Warning in confusionMatrix.default(thres, z, positive = "1", dnn =
## c("Prediction", : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9795 204
## 1 0 0
##
## Accuracy : 0.9796
## 95% CI : (0.9766, 0.9823)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 0.5186
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9796
## Precision : NA
## Recall : 0.0000
## F1 : NA
## Prevalence : 0.0204
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
Threshold of 0.8
Threshold of 1
adjust_thres(corona_eval_prob$`1`,1, test$corona_result)
## Warning in confusionMatrix.default(thres, z, positive = "1", dnn =
## c("Prediction", : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 9795 204
## 1 0 0
##
## Accuracy : 0.9796
## 95% CI : (0.9766, 0.9823)
## No Information Rate : 0.9796
## P-Value [Acc > NIR] : 0.5186
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9796
## Precision : NA
## Recall : 0.0000
## F1 : NA
## Prevalence : 0.0204
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
There are only 3 different possibilities as to the contents of the confusion matrices: all FALSE (not accurate), all TRUE (not accurate), and the one stationed in the middle of those two (we went over the metrics for this above). Because I was confused, I looked directly at the corona_eval_prob to see what the problem was. When I did this, I realized that this “optimal” KNN algorithm that was created off of all of Corona_Data’s training data, was simply using the “test_indication” variable to determine whether they would get covid. I mean it did work, it was more accurate than just guessing,however, while the FPR was amazing, the TPR was just okay. This is similar to the example we did in class about where the funfetti variable had too much emphasis.
If I were to change something in the Covid Indicator KNN model, I would probably not consider the “test_indication” variable as that variable alone was more accurate than guessing. An alternative to this is to limit the amount of negative covid results in the data so that the positive to negative ratio in the actual data because it was 58729/60000 to 1271/60000 (or roughly 97.5:2.5). We could delete negative covid_results to make the ratio closer to 75:25 or even closer to 50:50 and then create a new model based on that data. On top of this I had to limit the data in the beginning as the RAM that I allocated for RStudios couldn’t handle the whole data set, even at the maximum allowed allotment. I would call this KNN model a failure as when it was wrong, it was wrong by a lot, and it only used 1 variable to compute its predictions. Lastly, if we learned anything from this specific KNN model, it would be that if you know for sure that you came in contact with a covid-positive person, you should definitely take a covid test.