Heart Attack

Heart attacks are life threatening events that afflict many people each year. From data gathered on Kaggle it may be possible to determine the various attributes that increase the likelihood of a heart attack. Available in the data are the patient’s age, resting blood pressure (in mm Hg), cholesterol in mg/dl, and maximum heart rate achieved. Through k-nearest neighbors classification, we hope to build a model that will accurately predict whether or not an individual is at increase risk of a heart attack.

The data used can be found here: Heart Attack Analysis & Prediction

Part 1: Model Building

The initial base rates for those at increased risk (1) and those not at increased risk (0).

## [1] "Positive Base Rate:  0.544554455445545"

## [1] "Negative Base Rate:  0.455445544554455"

In order for the model to be effective, it must accurately predict an increased likelihood of heart attack at a higher rate than random chance. In this case, if all patients were assigned an increased risk of heart attack, this method would be correct nearly 55% of the time. The model needs to be more precise in its assignment of a positive diagnosis.

## [1] 0.8019802

## [1] 243

## [1] 60

Evaluation metrics

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 17  5
##          1 10 28
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6214, 0.8528)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.001116        
##                                           
##                   Kappa : 0.4863          
##                                           
##  Mcnemar's Test P-Value : 0.301700        
##                                           
##             Sensitivity : 0.8485          
##             Specificity : 0.6296          
##          Pos Pred Value : 0.7368          
##          Neg Pred Value : 0.7727          
##              Prevalence : 0.5500          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.6333          
##       Balanced Accuracy : 0.7391          
##                                           
##        'Positive' Class : 1               
##

The confusion matrix above shows that the model is more accurate than random assignment with an accuracy rate of 75 percent. Additionally, its sensitivity is relatively high meaning it can identify true positives fairly well at 84.85 percent. Although, the model does not predict false positives well. The false positive rate is close to 40 percent. In this case, the over-assignment of a positive diagnosis may be warranted. If there are legitimate concerns over a patient’s health, warning them preemptively of a heart attack may save their life. This is certainly more ideal than the opposite outcome where there is a severe over assignment of false negatives. With a Kappa score of 0.4863, the model may be better than random chance at assigning patients the appropriate diagnosis.

## [[1]]
## [1] 0.7777778

With an AUC value close to 78 percent, the model is fairly good at predicting whether or not a patient is at increased risk of a heart attack.

## [1] "LogLoss: 1.11534182738525"

In order for a model to be considered generalizable, Log Loss should be fairly close to zero. A Log Loss of 1.11 is a pretty terrible result and implies the model only predicts on two extremes, such that when the model is wrong, it is really wrong.

Part 2: Mis-Classification

##    pred_class pred_prob target
## 1           1 0.5555556      1
## 2           1 0.5555556      1
## 3           1 0.7777778      1
## 4           0 0.1111111      1
## 5           1 0.8888889      1
## 6           1 0.6666667      1
## 7           1 0.6666667      1
## 8           1 0.6666667      1
## 9           1 0.5555556      1
## 10          0 0.1111111      1

As indicated by the confusion matrix, many of the errors are false positives, which, in this case, is likely better than too many false negatives. Considering this model aims to predict based on medical data, over predicting positive cases is likely better than under predicting said class. One solution to fix this would be to raise the threshold for predicting the positive class. Given the high Log Loss value, the threshold will likely need to increase substantially to see a shift from false positives to true negatives.

Part 3: Threshold Change

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 21 13
##          1  6 20
##                                           
##                Accuracy : 0.6833          
##                  95% CI : (0.5504, 0.7974)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.02461         
##                                           
##                   Kappa : 0.375           
##                                           
##  Mcnemar's Test P-Value : 0.16867         
##                                           
##             Sensitivity : 0.6061          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.6176          
##               Precision : 0.7692          
##                  Recall : 0.6061          
##                      F1 : 0.6780          
##              Prevalence : 0.5500          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.4333          
##       Balanced Accuracy : 0.6919          
##                                           
##        'Positive' Class : 1               
##

The higher threshold arguably made the model worse. The accuracy decreased as fewer observations are predicted correctly. While more true negatives were predicted, fewer true positives were predicted. The sensitivity, which is likely the most important metric in this case, decreased from nearly 85 percent to approximately 61 percent. When choosing between the two models, the first model is likely the best choice.

Part 4: Further Development

In order to improve the quality of this model, researchers should consider including other variables like fasting blood sugar and blood oxidization. These could be beneficial to predicting whether or not an individual is at a higher risk of experiencing a heart attack. Furthermore, a change in the type of model used to predict heart attack risk may also be beneficial. There are some categorical variables included in the dataset that may assist with prediction, such as sex and type of chest pain. Additionally, k-nearest neighbors assumes that similar things exist in close proximity so that they can be easily identified as a group. In this case, individuals pre-disposed to heart attacks likely share similar traits with one another. While the assumption may generally hold true, the data will almost never meet the assumption perfectly.

Used Motorcycles

This first model aims to predict whether or not a motorcycle will be at high, moderate, or low value at resale. Motorcycles that are considered “high” quality retain at least 80 percent of their showroom price (sell price divided by showroom price). Motorcycles of “moderate” quality have a sell price that is 50 to 80 percent of their showroom price. Those that sell for less than 50 percent of their showroom price are “low” quality.

Part 1: Model Building

Base rate for each of the three motorcycle values.

## [1] "High Base Rate:  0.263578274760383"

## [1] "Moderate Base rate:  0.472843450479233"

## [1] "Low Base Rate:  0.263578274760383"

The threshold for whether or not the model is a good predictor of a motorcycle’s value changes depending on the classification. For high value and low value predictions, the threshold is 26.4 percent. By contrast, the threshold is 47.3 percent for moderate values. This means that the model needs to accurately predict high and low value motorcycles more than 26.4 percent of the time or else the model is no better than random chance. Similarly, the model must accurately predict moderate value bikes on more than 47.3 percent of occasions.

Check to ensure there is an appropriate number of data points in the training data and the testing data.

## [1] 0.8003195

## [1] 501

## [1] 125

Evaluation metrics

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction high moderate low
##   high       19       10   0
##   moderate   14       37  12
##   low         0       12  21
## 
## Overall Statistics
##                                           
##                Accuracy : 0.616           
##                  95% CI : (0.5248, 0.7016)
##     No Information Rate : 0.472           
##     P-Value [Acc > NIR] : 0.000847        
##                                           
##                   Kappa : 0.3916          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: high Class: moderate Class: low
## Sensitivity               0.5758          0.6271     0.6364
## Specificity               0.8913          0.6061     0.8696
## Pos Pred Value            0.6552          0.5873     0.6364
## Neg Pred Value            0.8542          0.6452     0.8696
## Prevalence                0.2640          0.4720     0.2640
## Detection Rate            0.1520          0.2960     0.1680
## Detection Prevalence      0.2320          0.5040     0.2640
## Balanced Accuracy         0.7335          0.6166     0.7530

From the confusion matrix above, it is clear the model is not predicting true positive values at an acceptable rate. All sensitivity values fall between 57 percent and 64 percent which is fairly poor. On the other hand, the false positive rate (1 - specificity) is quite good for high and low predictions. However, the model’s false positive rate for moderate predictions is dreadful at approximately 40 percent. While the model is relatively accurate and better than random chance at predicting a motorcycle’s resale value, the model is not reliable. The Kappa score of 39.2 percent confirms this. While there is nearly a moderate amount of agreement on what a given observation should be labelled, the model does not meet the 41 to 60 percent threshold.

## [1] "F1 Score:  0.612903225806452"

F1 scores closer to 1 signify greater model performance; however, in this case, the F1 score is about 0.61. As other tests showed, this indicates the model’s true positive to false positive as well as true positive to false negative ratios are not large enough to increase the F1 score to an acceptable level. The F1 score, however, may not be an ideal test of this model given the data is relatively balanced across the three response factors. The F1 score is primarily meant for unbalanced datasets, such as data with 90 percent negative responses and 10 percent positive responses.

Part 2: Mis-Classification

##    pred_class      high  moderate       low target
## 1    moderate 0.1111111 0.6666667 0.2222222      2
## 2         low 0.1111111 0.4444444 0.4444444      3
## 3    moderate 0.1111111 0.5555556 0.3333333      2
## 4    moderate 0.4545455 0.5454545 0.0000000      1
## 5        high 0.7777778 0.2222222 0.0000000      1
## 6        high 0.7777778 0.2222222 0.0000000      2
## 7    moderate 0.4444444 0.5555556 0.0000000      1
## 8    moderate 0.1111111 0.8888889 0.0000000      3
## 9         low 0.0000000 0.3333333 0.6666667      3
## 10       high 1.0000000 0.0000000 0.0000000      1

Fortunately, the current model does not predict low-value motorcycles as high and high-value motorcycles as low. The model, however, is still relatively ineffective at accurately predicting the value of a used motorcycle. The model is also inconsistent for all three classes of the response variable. It is relatively mediocre at predicting each of the classes. There is no clear pattern that exists within the results to suggest that changing the threshold to any particular level will alter the model in such a way that the model will improve according to any of the metrics previously used.

Parts 3 and 4: Threshold and Future Development

In addition to a threshold change being largely ineffectual, a threshold change with more than two classification variables was not possible with the current nature of the function. Given the three different classes of used motorcycles, it would likely be beneficial to find more data for each of the classes. In its current state, the model may not have enough information to accurately assign observations to the proper prediction class. Once again, the model assumes that used motorcycles of high, moderate, and low value all are relatively “close” together within their respective class. There may be some factors, such as damage to the body, that the model is unable to account for as KNN requires quantitative variables.

Evaluating K-Nearest Neighbors

Jay Ralyea

4/19/2020

Heart Attack

Part 1: Model Building

Part 2: Mis-Classification

Part 3: Threshold Change

Part 4: Further Development

Used Motorcycles

Part 1: Model Building

Part 2: Mis-Classification

Parts 3 and 4: Threshold and Future Development