Part 1

First, we read in the data and clean it. As for this particular dataset, it has no NAs or odd characters so there isn’t much to do.

Implementation

This data set involves different variables that are considered to affect the heart and sometimes lead to even heart failure. We want to apply KNN to this dataset to determine if we can create a model that predicts whether the patient will have a heart failure or not.

Data Source: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

Accuracy

Based off of this evaluation, time is the most important variable when approaching focusing on which variable will reduce model error

#And now just for fun! A barplot of the level of importance each variable plays
library(RColorBrewer)
coul <- brewer.pal(5, "Set2")
barplot(heart_tree$finalModel$variable.importance, col=coul)

Confusion Matrix
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 37  7
##          1  3 12
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 0.006651        
##                                           
##                   Kappa : 0.5891          
##                                           
##  Mcnemar's Test P-Value : 0.342782        
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.9250          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8409          
##              Prevalence : 0.3220          
##          Detection Rate : 0.2034          
##    Detection Prevalence : 0.2542          
##       Balanced Accuracy : 0.7783          
##                                           
##        'Positive' Class : 1               
## 
## [1] 0.1694915

From the above we can see our True Positive Rate or sensitivity is at 63%, False Positive Rate (1-Specificity) is at 8%, we want this to be low. The accuracy is at 83.05%. And the error is at 16.95%

ROCR

We plotted the linear regression of the true positive rate against the false positive rate and color coded it.

## [[1]]
## [1] 0.7782895
LogLoss and F1 score
## [1] 1.36322
## [1] 0.8809524

The LogLoss score is found to be 1.36322 while the F1 score is found to be 0.8809524.

The LogLoss ideally should be 0 so this could be further improved. This indicates some uncertainty in the data.

The F1 score is derived from the confusion matrix which then generates a precision and sensitivity score that is then weighted and combined to form the F1 score This number is pretty good since the ideal value is 1 indicating there are low false positive and low false negatives.

Part 2

When you look at the confusion matrix shown above, the pattern seems similar to our COVID indicator KNN model in that there are a large portion of true values that are being predicted incorrectly by our model, in this case the 7/19 actual deaths are being predicted to not die, that is over a third. However, unlike our other model, there are numerous important variables in this KNN model, though time is by far the most important according to the list of important variables

Part 3

Threshold Adjustment

First Set

All predicted deaths (not accurate)

## Warning in confusionMatrix.default(thres, z, positive = "1", dnn =
## c("Prediction", : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0  0  0
##          1 40 19
##                                           
##                Accuracy : 0.322           
##                  95% CI : (0.2062, 0.4564)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 6.984e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.3220          
##          Neg Pred Value :    NaN          
##               Precision : 0.3220          
##                  Recall : 1.0000          
##                      F1 : 0.4872          
##              Prevalence : 0.3220          
##          Detection Rate : 0.3220          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 
Second Set

Previously evaluated death confusion matrix for all

Threshold of 0.2

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 37  7
##          1  3 12
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 0.006651        
##                                           
##                   Kappa : 0.5891          
##                                           
##  Mcnemar's Test P-Value : 0.342782        
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.9250          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8409          
##               Precision : 0.8000          
##                  Recall : 0.6316          
##                      F1 : 0.7059          
##              Prevalence : 0.3220          
##          Detection Rate : 0.2034          
##    Detection Prevalence : 0.2542          
##       Balanced Accuracy : 0.7783          
##                                           
##        'Positive' Class : 1               
## 

Threshold of 0.5

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 37  7
##          1  3 12
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 0.006651        
##                                           
##                   Kappa : 0.5891          
##                                           
##  Mcnemar's Test P-Value : 0.342782        
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.9250          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8409          
##               Precision : 0.8000          
##                  Recall : 0.6316          
##                      F1 : 0.7059          
##              Prevalence : 0.3220          
##          Detection Rate : 0.2034          
##    Detection Prevalence : 0.2542          
##       Balanced Accuracy : 0.7783          
##                                           
##        'Positive' Class : 1               
## 

Threshold of 0.8

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 37  7
##          1  3 12
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 0.006651        
##                                           
##                   Kappa : 0.5891          
##                                           
##  Mcnemar's Test P-Value : 0.342782        
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.9250          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8409          
##               Precision : 0.8000          
##                  Recall : 0.6316          
##                      F1 : 0.7059          
##              Prevalence : 0.3220          
##          Detection Rate : 0.2034          
##    Detection Prevalence : 0.2542          
##       Balanced Accuracy : 0.7783          
##                                           
##        'Positive' Class : 1               
## 
Third Set

Zero predicted deaths (not accurate) using a threshold of 0.9

## Warning in confusionMatrix.default(thres, z, positive = "1", dnn =
## c("Prediction", : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 40 19
##          1  0  0
##                                           
##                Accuracy : 0.678           
##                  95% CI : (0.5436, 0.7938)
##     No Information Rate : 0.678           
##     P-Value [Acc > NIR] : 0.5618          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 3.636e-05       
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.678           
##               Precision :    NA           
##                  Recall : 0.000           
##                      F1 :    NA           
##              Prevalence : 0.322           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : 1               
## 
Deductions

Unfortunately we have gotten a case much like the covid indicator with this KNN model, where it relies way too much on one variable to make its prediction, and yet again has only three possible confusion matrices for any given threshold: all FALSE (not accurate), all TRUE (not accurate), and the one stationed in the middle of those two (we went over the metrics for this above). Yet again the FPR was pretty good, however this time the TPR was also decent.

Part 4

This data set needs more data collection to create a larger sample size. There are less than 300 data entries in the entire set. On top of this, when the data was wrong, it was wrong by a large margin, and changing the threshold did not help as the threshold that was used immediately was the most accurate. The major significant finding was that lower time (which is the follow up period for the patient) is a significant indicator of death from heart attacks.

Both of these models had the same flaw, and because of that, we believe that it is integral to check the variable importance of every KNN model that is created, just to make sure that it isn’t placing too much emphasis on a singular variable when trying to create the optimal model.