heart_data <- read.csv("heart_data.csv")
# Create data subset to ensure equal representation of heart attacks and non-heart attacks
# Sum number of non-heart attacks
attack_no <- sum(heart_data$Heart_Attack == "No")
# Filter to all non-heart attacks
heart_data_no <- filter(heart_data, Heart_Attack == "No")
# Filter to all heart attacks and then random sample to equal non-heart attack number
set.seed(35602287)
heart_data_yes <- heart_data %>% filter(Heart_Attack == "Yes") %>% sample_n(attack_no)
# Combine the yes and no datasets
heart_data <- rbind(heart_data_no, heart_data_yes)
This report explores the correlation between various cardiovascular risk factors and utilises predictive modelling techniques to ascertain the likelihood of heart attacks in hospital patients.
A subset of the heart data set (Monash Online, 2025) will be analysed in this report, as the original dataset is highly skewed towards patients who suffered heart attacks. A such, a subset has been selected containing all non-heart attack observations and a random, equal selection of heart attack observations.
The data subset contains 222 observations across 21 variables. Each observation contains the recordings of cardiovascular risk factor values for individual patients, and whether that patient experienced a heart attack.
The variables within the heart data set can be observed in Table 1 below:
# Use Kable to create a table of the variable names and their type
# Remove row names so that each variable name does not double as a row name
knitr::kable(data.frame(
Variable = names(heart_data),
Type = sapply(heart_data, class),
row.names = NULL), caption = "Heart Data Set Variables")
Variable | Type |
---|---|
Heart_Attack | character |
Gender | character |
Family_History | character |
Diabetes | character |
Obesity | character |
Smoking_Status | character |
Physical_Inactivity | character |
Age | integer |
Alcohol_Consumption | integer |
Systolic_BP | integer |
Total_Cholesterol | integer |
HDL_Cholesterol | integer |
LDL_Cholesterol | integer |
Triglycerides | integer |
Fasting_Glucose | integer |
BMI | numeric |
Waist_Circumference | integer |
Heart_Rate | integer |
Resting_ECG | integer |
Physical_Activity_Minutes | integer |
High_BP | character |
Table 1 highlights that 13 of the variables have numerical values and 7 have categorical values for analysis and modelling purposes. Each risk factor, or variable, captures different demographic, diagnostic or lifestyle information which may contribute to the patient’s likelihood of experiencing a heart attack. The response variable indicates whether the patient experienced a heart attack.
The summary statistics for each variable can be observed in Table 2 below:
# Convert character variables to factors so that frequencies can be calculated (OpenAI 2024)
heart_data[] <- lapply(heart_data, function(x) {if (is.character(x)) as.factor(x) else x})
# Use Kable to create a summary table of the heart data
knitr::kable(summary(heart_data), caption = "Heart Data Summary Statistics", digits = 2)
Heart_Attack | Gender | Family_History | Diabetes | Obesity | Smoking_Status | Physical_Inactivity | Age | Alcohol_Consumption | Systolic_BP | Total_Cholesterol | HDL_Cholesterol | LDL_Cholesterol | Triglycerides | Fasting_Glucose | BMI | Waist_Circumference | Heart_Rate | Resting_ECG | Physical_Activity_Minutes | High_BP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No :111 | Female:109 | No :155 | No :189 | No :160 | Non-smoker:167 | No :129 | Min. :20.00 | Min. : -8.00 | Min. : 91.0 | Min. :120 | Min. :16.00 | Min. : 45.0 | Min. : 26.0 | Min. : 51.00 | Min. :15.8 | Min. : 62.00 | Min. : 51.00 | Min. :0.0000 | Min. :-45.0 | No :120 | |
Yes:111 | Male :113 | Yes: 67 | Yes: 33 | Yes: 62 | Smoker : 55 | Yes: 93 | 1st Qu.:30.00 | 1st Qu.: 60.00 | 1st Qu.:116.2 | 1st Qu.:172 | 1st Qu.:42.00 | 1st Qu.:105.0 | 1st Qu.:123.2 | 1st Qu.: 89.25 | 1st Qu.:23.7 | 1st Qu.: 85.25 | 1st Qu.: 67.25 | 1st Qu.:0.0000 | 1st Qu.: 88.5 | Yes:102 | |
NA | NA | NA | NA | NA | NA | NA | Median :43.00 | Median : 85.00 | Median :128.0 | Median :198 | Median :48.50 | Median :119.0 | Median :151.0 | Median : 99.00 | Median :26.4 | Median : 94.00 | Median : 74.50 | Median :0.0000 | Median :123.5 | NA | |
NA | NA | NA | NA | NA | NA | NA | Mean :45.05 | Mean : 81.62 | Mean :127.5 | Mean :195 | Mean :48.17 | Mean :120.9 | Mean :148.6 | Mean : 98.49 | Mean :26.6 | Mean : 94.34 | Mean : 75.17 | Mean :0.4054 | Mean :122.7 | NA | |
NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:59.00 | 3rd Qu.:104.75 | 3rd Qu.:138.0 | 3rd Qu.:214 | 3rd Qu.:53.00 | 3rd Qu.:138.8 | 3rd Qu.:173.8 | 3rd Qu.:109.00 | 3rd Qu.:29.5 | 3rd Qu.:103.00 | 3rd Qu.: 82.00 | 3rd Qu.:1.0000 | 3rd Qu.:156.0 | NA | |
NA | NA | NA | NA | NA | NA | NA | Max. :80.00 | Max. :150.00 | Max. :174.0 | Max. :315 | Max. :75.00 | Max. :189.0 | Max. :267.0 | Max. :132.00 | Max. :37.8 | Max. :130.00 | Max. :105.00 | Max. :2.0000 | Max. :276.0 | NA |
Table 2 observes that within the heart data set, there are an equal number of heart attack and non-heart attack cases for balanced analysis. Demographically, there are a similar number of male and female patients with ages ranging from 20 to 80. The mean age of 45.05 indicates that the group is predominantly middle-aged.
Regarding diagnostic categorical variables, 70% of patients do not have family history of heart attacks, 85% are non-diabetic, 72% are not classified as obese, 75% are non-smokers, 58% are physically active and 54% do not have high blood-pressure suggesting there is a mix of patients with both healthy and “at-risk” recordings.
Alcohol consumption within the sample ranges widely, from -8 grams to 150 grams per week, however the mean is slightly lower than the median, at 81.62 grams indicating that many patients are not drinking at extreme levels. Systolic blood pressure and LDL cholesterol have balanced mean and medians of approximately 127.75mmHg and 120.4mg/DL respectively, suggesting normal distributions ideal for modelling both normal and elevated levels. Triglyceride recordings range from 26mg/DL to 267mg/DL, indicating that bodily fat levels are largely fluctuating between individuals. BMI results show a mean of 26.6kg/m² and median of 26.4kg/m² among the sample, indicating that most patients are overweight (National Heart, Lung, and Blood Institute, 2025). The mean of resting ECG test scores is 0.405 with a median of 0, indicating that most patients recorded normal to mild results.
In this report, predictive models will be developed to classify whether a patient may be at risk of a heart attack based on the values of the most important risk factors. The variables in this data set considered most relevant for predicting the likelihood of a heart attack include:
# Use student ID to set random seed
set.seed(35602287)
# Calculate number of heart data observations for splitting
size <- nrow(heart_data)
# Create the training index for 70% of observations
train_index <- sample(1:size, size * 0.7)
# Create training data set for 70% of observations
train_df <- heart_data[train_index, ]
# Create test data set for remaining 30% of observations
test_df <- heart_data[-train_index, ]
The logistic regression model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.
# Build logistic model using specified variables
log_mod <- glm(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df, family = binomial)
# Use the generated model to predict each of the observations in the training set
log_mod_prob <- predict(log_mod, train_df, type = "response")
# Convert prediction values into Yes or No with a 0.5 threshold
log_mod_pred <- ifelse(log_mod_prob > 0.5, "Yes", "No")
With a 0.5 threshold, the logistic model predicts 78 heart attacks and 77 non-heart attacks, which reflects the 50/50 split in the original dataset.
# Generate the confusion matrix
cm <- table(Predicted = log_mod_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics
roc_curve <- roc(train_df$Heart_Attack, log_mod_prob)
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 54 23
## Yes 21 57
## Area under the curve: 0.7938
## Sensitivity = 0.7125
## Specificity = 0.72
## Misclassification Error = 0.283871
The logistic model has reasonable predictive ability with an AUC of 0.79 from the ROC curve and balanced sensitivity (0.71) and specificity (0.72). The misclassification error is also relatively low at 0.28, highlighting that the model successfully discriminates between at-risk and non-risk patients. For model tuning and clinical relevance, the threshold is reduced to 0.4 to prioritise sensitivity and ensure that the likelihood of missing a high-risk patient is minimised.
# Convert prediction values into Yes or No with a 0.4 threshold
log_mod_pred2 <- ifelse(log_mod_prob > 0.4, "Yes", "No")
# Generate the confusion matrix
cm2 <- table(Predicted = log_mod_pred2, Actual = train_df$Heart_Attack)
# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2
## Actual
## Predicted No Yes
## No 41 11
## Yes 34 69
## Sensitivity (0.4 threshold) = 0.8625
## Specificity (0.4 threshold) = 0.5466667
## Misclassification Error (0.4 threshold) = 0.2903226
At the reduced threshold, sensitivity increases to 0.86 with specificity reducing to 0.55 and the misclassification error rising slightly to 0.29. This sacrifice in specificity and misclassification is clinically justified due to the importance of detecting actual heart attack patients over misdiagnosing non-heart attack patients.
Both forward and backward stepwise variable selection is applied to determine whether the variables incorporated in the model are all significant predictors for heart attack risk.
## Start: AIC=192.51
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption +
## Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG
# Conduct backward stepwise variable selection
heart_backward <- step(log_mod, direction = "backward")
## Start: AIC=192.51
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption +
## Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG
##
## Df Deviance AIC
## - Resting_ECG 1 170.78 190.78
## - LDL_Cholesterol 1 170.84 190.84
## - Gender 1 171.08 191.08
## - Smoking_Status 1 171.47 191.47
## - Alcohol_Consumption 1 171.47 191.47
## <none> 170.51 192.51
## - Triglycerides 1 173.28 193.28
## - BMI 1 174.81 194.81
## - Diabetes 1 176.82 196.82
## - Systolic_BP 1 183.40 203.40
## - Age 1 188.49 208.49
##
## Step: AIC=190.78
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption +
## Systolic_BP + LDL_Cholesterol + Triglycerides + BMI
##
## Df Deviance AIC
## - LDL_Cholesterol 1 171.10 189.10
## - Gender 1 171.38 189.38
## - Smoking_Status 1 171.72 189.72
## - Alcohol_Consumption 1 171.73 189.73
## <none> 170.78 190.78
## - Triglycerides 1 173.56 191.56
## - BMI 1 175.05 193.05
## - Diabetes 1 176.94 194.94
## - Systolic_BP 1 183.75 201.75
## - Age 1 189.01 207.01
##
## Step: AIC=189.1
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption +
## Systolic_BP + Triglycerides + BMI
##
## Df Deviance AIC
## - Gender 1 171.91 187.91
## - Smoking_Status 1 171.92 187.92
## - Alcohol_Consumption 1 172.04 188.04
## <none> 171.10 189.10
## - Triglycerides 1 173.96 189.96
## - BMI 1 175.41 191.41
## - Diabetes 1 177.08 193.08
## - Systolic_BP 1 184.85 200.85
## - Age 1 189.26 205.26
##
## Step: AIC=187.91
## Heart_Attack ~ Age + Diabetes + Smoking_Status + Alcohol_Consumption +
## Systolic_BP + Triglycerides + BMI
##
## Df Deviance AIC
## - Smoking_Status 1 172.77 186.77
## - Alcohol_Consumption 1 172.86 186.86
## <none> 171.91 187.91
## - Triglycerides 1 175.16 189.16
## - BMI 1 176.48 190.48
## - Diabetes 1 178.31 192.31
## - Systolic_BP 1 186.43 200.43
## - Age 1 190.12 204.12
##
## Step: AIC=186.77
## Heart_Attack ~ Age + Diabetes + Alcohol_Consumption + Systolic_BP +
## Triglycerides + BMI
##
## Df Deviance AIC
## - Alcohol_Consumption 1 173.67 185.67
## <none> 172.77 186.77
## - Triglycerides 1 176.52 188.52
## - BMI 1 177.54 189.54
## - Diabetes 1 179.09 191.09
## - Systolic_BP 1 186.93 198.93
## - Age 1 191.64 203.64
##
## Step: AIC=185.67
## Heart_Attack ~ Age + Diabetes + Systolic_BP + Triglycerides +
## BMI
##
## Df Deviance AIC
## <none> 173.67 185.67
## - Triglycerides 1 177.37 187.37
## - BMI 1 178.70 188.70
## - Diabetes 1 179.46 189.46
## - Systolic_BP 1 187.63 197.63
## - Age 1 191.92 201.92
##
## Call:
## glm(formula = Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status +
## Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides +
## BMI + Resting_ECG, family = binomial, data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.656867 2.800500 -4.520 6.2e-06 ***
## GenderMale 0.289108 0.384243 0.752 0.451805
## Age 0.049712 0.012781 3.889 0.000100 ***
## DiabetesYes 1.301108 0.537003 2.423 0.015397 *
## Smoking_StatusSmoker 0.429878 0.442048 0.972 0.330817
## Alcohol_Consumption -0.005700 0.005841 -0.976 0.329140
## Systolic_BP 0.045972 0.013769 3.339 0.000841 ***
## LDL_Cholesterol 0.004906 0.008581 0.572 0.567498
## Triglycerides 0.009351 0.005667 1.650 0.098911 .
## BMI 0.102535 0.050918 2.014 0.044036 *
## Resting_ECG -0.145819 0.282608 -0.516 0.605870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 214.71 on 154 degrees of freedom
## Residual deviance: 170.51 on 144 degrees of freedom
## AIC: 192.51
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = Heart_Attack ~ Age + Diabetes + Systolic_BP + Triglycerides +
## BMI, family = binomial, data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.796540 2.667800 -4.797 1.61e-06 ***
## Age 0.048645 0.012390 3.926 8.63e-05 ***
## DiabetesYes 1.210555 0.520703 2.325 0.020080 *
## Systolic_BP 0.047211 0.013569 3.479 0.000503 ***
## Triglycerides 0.010597 0.005581 1.899 0.057581 .
## BMI 0.109158 0.050293 2.170 0.029972 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 214.71 on 154 degrees of freedom
## Residual deviance: 173.67 on 149 degrees of freedom
## AIC: 185.67
##
## Number of Fisher Scoring iterations: 4
Both forward and backward selection identify that Age, Diabetes, Systolic_BP and BMI are statistically significant, with p-values < 0.05. Forward selection stopped at the full model, retaining all the original variables, retaining the original AIC at 192.51, indicating potential overfitting by keeping weaker predictors.
Backward selection retained Age, Diabetes, Systolic_BP, Triglycerides and BMI and dropped the variables with lower predictive significance. The AIC is lower at 185.67, reducing noise and overfitting, however this may sacrifice model sensitivity leading to more at-risk patients being missed.
# Create the backward selection variables model
back_mod <- glm(Heart_Attack ~ Age + Diabetes + Systolic_BP + BMI + Triglycerides, data=train_df, family = binomial)
# Use the new model to predict each of the observations in the training set
back_mod_prob <- predict(back_mod, train_df, type = "response")
# Convert prediction values into Yes or No with a 0.4 threshold
back_mod_pred <- ifelse(back_mod_prob > 0.4, "Yes", "No")
# Generate the confusion matrix
cm_back <- table(Predicted = back_mod_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics
roc_curve_back <- roc(train_df$Heart_Attack, back_mod_prob)
# Calculate error rates and metrics
sensitivity_back <- cm_back[2,2]/(cm_back[2,2]+cm_back[1,2])
specificity_back <- cm_back[1,1]/(cm_back[1,1]+cm_back[2,1])
misclassification_back <- (cm_back[2,1] + cm_back[1,2])/nrow(train_df)
cm_back
## Actual
## Predicted No Yes
## No 43 10
## Yes 32 70
## Area under the curve: 0.7842
## Sensitivity (Backward Step) = 0.875
## Specificity (Backward Step) = 0.5733333
## Misclassification Error (Backward Step) = 0.2709677
The revised backward model returns a similar AUC at 0.78 compared to the original model at 0.79. Sensitivity is increased at 0.88 in the backward model (compared to 0.86 full model 0.4 threshold), specificity increased to 0.57 (compared to 0.55) and misclassification is also lower at 0.27 (compared to 0.29).
The backward model also predicted the less false negatives (11) than the full model, meaning its performance in accurately detecting at-risk patients is greater. Clinically, this means less at-risk patients are missed from being flagged for further interventions in clinic to prevent heart attacks.
The revised backward model with 0.4 threshold will be included in further evaluation and comparison.
The linear discriminant analysis model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.
# Build lda model using specified variables
lda_mod <- lda(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df)
# Use the generated model to predict each of the observations in the training set
lda_mod_prob <- predict(lda_mod, train_df)
# Extract class predictions with default 0.5 threshold
lda_mod_pred <- ifelse(lda_mod_prob$posterior[,"Yes"] > 0.5, "Yes", "No")
With the default 0.5 threshold, the LDA model predicts 77 heart attacks and 78 non-heart attacks, which reflects the 50/50 split in the original dataset.
# Generate the confusion matrix
cm <- table(Predicted = lda_mod_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, lda_mod_prob$posterior[, "Yes"])
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 54 24
## Yes 21 56
## Area under the curve: 0.7942
## Sensitivity = 0.7
## Specificity = 0.72
## Misclassification Error = 0.2903226
The LDA model has reasonable predictive ability with an AUC of 0.79 from the ROC curve and balanced sensitivity (0.7) and specificity (0.72). The misclassification error is also relatively low at 0.29, showing evident discrimination between patients. To achieve clinical relevance and reduce false negative classifications of at-risk patients, the threshold is tuned to 0.4.
# Manually update to 0.4 threshold for classification of probabilities
lda_mod_pred2 <- ifelse(lda_mod_prob$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrix
cm2 <- table(Predicted = lda_mod_pred2, Actual = train_df$Heart_Attack)
# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2
## Actual
## Predicted No Yes
## No 42 11
## Yes 33 69
## Sensitivity (0.4 threshold) = 0.8625
## Specificity (0.4 threshold) = 0.56
## Misclassification Error (0.4 threshold) = 0.283871
The 0.4 threshold returns increased sensitivity of 0.86, reduced specificity of 0.56 and a slightly lower misclassification error of 0.28. Due to the severe clinical implications of failing to detect at-risk patients, the tuned threshold is more appropriate to maximise sensitivity.
K-fold cross-validation is applied to the model to assess how effectively it generalises to unseen patient data and ensure it is not overfitting by using more variables than required in predictions, thus misdiagnosing healthy patients and neglecting intervention for at-risk patients.
# Create function for repeated k-fold cross-validation (Geeks for Geeks 2025)
cross_validate_lda <- function(data, target, k) {
set.seed(35602287) # For reproducibility
# Define the trainControl object including calculations for ROC sensitivity and specificity
train_control <- trainControl(method = "repeatedcv", number = k, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
# Train the lda regression model with cross-validation
model <- train(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG,
data = data,
method = "lda",
metric = "ROC",
trControl = train_control)
return(model)
}
# Perform 10-Fold Cross-Validation
cv_model <- cross_validate_lda(train_df, "Heart_Attack", 10)
# Display the cross-validation results
print(cv_model)
## Linear Discriminant Analysis
##
## 155 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7385417 0.6410714 0.675
## ROC Sens Spec Resample
## 1 0.5781250 0.6250000 0.500 Fold01.Rep1
## 2 0.9218750 0.8750000 0.750 Fold02.Rep1
## 3 0.5000000 0.3750000 0.375 Fold03.Rep1
## 4 0.8571429 0.7142857 0.750 Fold04.Rep1
## 5 0.7343750 0.6250000 0.625 Fold05.Rep1
## 6 0.7343750 0.6250000 0.625 Fold06.Rep1
## 7 0.7321429 0.7142857 0.500 Fold07.Rep1
## 8 0.8571429 0.5714286 0.875 Fold08.Rep1
## 9 0.6071429 0.4285714 0.750 Fold09.Rep1
## 10 0.8035714 0.7142857 0.875 Fold10.Rep1
## 11 0.8750000 0.7500000 0.875 Fold01.Rep2
## 12 0.7343750 0.5000000 0.750 Fold02.Rep2
## 13 0.6875000 0.6250000 0.625 Fold03.Rep2
## 14 0.9285714 0.8571429 0.750 Fold04.Rep2
## 15 0.6875000 0.7500000 0.500 Fold05.Rep2
## 16 0.6071429 0.5714286 0.500 Fold06.Rep2
## 17 0.7678571 0.7142857 0.875 Fold07.Rep2
## 18 0.5178571 0.5714286 0.375 Fold08.Rep2
## 19 0.8928571 0.8571429 0.750 Fold09.Rep2
## 20 0.6875000 0.3750000 0.750 Fold10.Rep2
## 21 0.8906250 0.8750000 0.625 Fold01.Rep3
## 22 0.6093750 0.3750000 0.750 Fold02.Rep3
## 23 0.5625000 0.6250000 0.250 Fold03.Rep3
## 24 0.5178571 0.4285714 0.750 Fold04.Rep3
## 25 0.7857143 0.5714286 0.750 Fold05.Rep3
## 26 0.8125000 0.6250000 1.000 Fold06.Rep3
## 27 0.8750000 0.7142857 0.750 Fold07.Rep3
## 28 0.6406250 0.7500000 0.375 Fold08.Rep3
## 29 0.8392857 0.7142857 0.750 Fold09.Rep3
## 30 0.9107143 0.7142857 0.875 Fold10.Rep3
Cross-validation confirms the model achieves mean 0.74 AUC, 0.64 sensitivity and 0.68 specificity. While the performance metrics are moderate, variability across folds indicates that some predictors may not be normally distributed or have equal variances, or are statistically weak in predictive power and sabotaging the model.
# Calculate the importance of predictors
vimp <- varImp(cv_model)
knitr::kable(vimp$importance, caption = "LDA Model Variable Importance", digits = 2)
No | Yes | |
---|---|---|
Gender | 32.62 | 32.62 |
Age | 78.17 | 78.17 |
Diabetes | 23.83 | 23.83 |
Smoking_Status | 14.79 | 14.79 |
Alcohol_Consumption | 9.03 | 9.03 |
Systolic_BP | 100.00 | 100.00 |
LDL_Cholesterol | 15.97 | 15.97 |
Triglycerides | 43.36 | 43.36 |
BMI | 34.28 | 34.28 |
Resting_ECG | 0.00 | 0.00 |
From Table 3 above, Systolic_BP is indicated as the most important predictor (scaled to 100), followed by Age (78.17), Triglycerides (43.36), BMI (34.28) and Gender (32.62), with Smoking_Status, LDL_Cholesterol, Alcohol_Consumption and Resting_ECG not contributing substantially to the model.
A simplified model is generated excluding the variables identified as less significant for predicting heart attack risk.
# Tune model removing insignificant variables (OpenAI 2024)
lda_refined <- lda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)
# Use the generated model to predict each of the observations in the training set
lda_ref_prob <- predict(lda_refined, train_df)
# Extract class predictions with 0.4 threshold
lda_ref_pred <- ifelse(lda_ref_prob$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrix
cm <- table(Predicted = lda_ref_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, lda_ref_prob$posterior[, "Yes"])
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 43 12
## Yes 32 68
## Area under the curve: 0.7925
## Sensitivity (0.4 threshold) = 0.85
## Specificity (0.4 threshold) = 0.5733333
## Misclassification Error (0.4 threshold) = 0.283871
# Conduct repeated cross validation on refined simpler model (OpenAI 2024)
set.seed(35602287)
cv_model_subset <- train(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI,
data = train_df,
method = "lda",
metric = "ROC",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
)
print(cv_model_subset)
## Linear Discriminant Analysis
##
## 155 samples
## 6 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7664435 0.685119 0.7208333
The simplified LDA model demonstrates similar moderate-strong performance across metrics, maintaining AUC of 0.79, sensitivity of 0.85, specificity of 0.57 and error rate of 0.27. Under cross-validation, the simplified model returns a higher average AUC of 0.77, sensitivity of 0.69 and specificity of 0.72, conveying this model generalises better to unseen patients due to the removal of weak heart-attack risk variables. However, the simplification lead to 12 at-risk patients being misdiagnosed as non-risk, compared to the full model which only missed 11 at-risk patients.
Despite the small increase in misdiagnoses of at-risk patients, the importance of generalisation and reliability of the model is crucial for new patients being accurately categorised. Thus, the simplified model with 0.4 threshold will be included in further evaluation and comparison.
The quadratic discriminant analysis model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.
# Build qda model using specified variables
qda_mod <- qda(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df)
# Use the generated model to predict each of the observations in the training set
qda_mod_prob <- predict(qda_mod, train_df)
# Extract class predictions with default 0.5 threshold
qda_mod_pred <- ifelse(qda_mod_prob$posterior[,"Yes"] > 0.5, "Yes", "No")
With the default 0.5 threshold, the QDA model predicts 75 heart attacks and 80 non-heart attacks, which is slightly unbalanced compared to the original dataset.
# Generate the confusion matrix
cm <- table(Predicted = qda_mod_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, qda_mod_prob$posterior[, "Yes"])
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 60 20
## Yes 15 60
## Area under the curve: 0.853
## Sensitivity = 0.75
## Specificity = 0.8
## Misclassification Error = 0.2258065
The QDA model has strong predictive ability with an AUC of 0.85 from the ROC curve and balanced sensitivity (0.75) and specificity (0.8). The misclassification error is also relaively low at 0.23, showing moderate discrimination between patients. The threshold is tuned to 0.4 to test if sensitivity can be further improved to detect a greater proportion of at-risk patients.
# Manually update to 0.4 threshold for classification of probabilities
qda_mod_pred2 <- ifelse(qda_mod_prob$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrix
cm2 <- table(Predicted = qda_mod_pred2, Actual = train_df$Heart_Attack)
# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2
## Actual
## Predicted No Yes
## No 52 13
## Yes 23 67
## Sensitivity (0.4 threshold) = 0.8375
## Specificity (0.4 threshold) = 0.6933333
## Misclassification Error (0.4 threshold) = 0.2322581
The reduced threshold returns increased sensitivity of 0.84 at the expense of decreased specificity of 0.69 and a greater misclassification error of 0.23. The sacrifice of reduced specificity at 0.4 threshold is justified to ensure greater sensitivity in detecting at-risk patients accurately and commencing clinical interventions.
K-fold cross-validation is applied to the model to confirm generalisation for unseen data and ensure that the model’s strength is maintained for new patients being assessed.
# Create function for repeated k-fold cross-validation (Geeks for Geeks 2025)
cross_validate_qda <- function(data, target, k) {
set.seed(35602287) # For reproducibility
# Define the trainControl object including calculations for ROC sensitivity and specificity
train_control <- trainControl(method = "repeatedcv", number = k, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
# Train the lda regression model with cross-validation
model <- train(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG,
data = data,
method = "qda",
metric = "ROC",
trControl = train_control)
return(model)
}
# Perform 10-Fold Cross-Validation
cv_model <- cross_validate_qda(train_df, "Heart_Attack", 10)
# Display the cross-validation results
print(cv_model)
## Quadratic Discriminant Analysis
##
## 155 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.6166667 0.5416667 0.5625
## ROC Sens Spec Resample
## 1 0.4531250 0.3750000 0.625 Fold01.Rep1
## 2 0.6250000 0.7500000 0.375 Fold02.Rep1
## 3 0.5625000 0.5000000 0.375 Fold03.Rep1
## 4 0.7142857 0.7142857 0.625 Fold04.Rep1
## 5 0.4531250 0.6250000 0.250 Fold05.Rep1
## 6 0.5937500 0.5000000 0.625 Fold06.Rep1
## 7 0.7142857 0.5714286 0.375 Fold07.Rep1
## 8 0.6250000 0.1428571 0.750 Fold08.Rep1
## 9 0.6428571 0.4285714 0.625 Fold09.Rep1
## 10 0.8928571 0.7142857 0.875 Fold10.Rep1
## 11 0.8437500 0.7500000 0.625 Fold01.Rep2
## 12 0.5312500 0.5000000 0.625 Fold02.Rep2
## 13 0.5781250 0.5000000 0.500 Fold03.Rep2
## 14 0.7678571 0.7142857 0.500 Fold04.Rep2
## 15 0.6718750 0.7500000 0.375 Fold05.Rep2
## 16 0.4642857 0.5714286 0.500 Fold06.Rep2
## 17 0.6071429 0.4285714 0.750 Fold07.Rep2
## 18 0.4464286 0.4285714 0.625 Fold08.Rep2
## 19 0.8571429 0.7142857 0.875 Fold09.Rep2
## 20 0.5468750 0.2500000 0.625 Fold10.Rep2
## 21 0.7031250 0.7500000 0.250 Fold01.Rep3
## 22 0.5937500 0.5000000 0.500 Fold02.Rep3
## 23 0.5000000 0.7500000 0.000 Fold03.Rep3
## 24 0.5535714 0.4285714 0.625 Fold04.Rep3
## 25 0.6250000 0.5714286 0.625 Fold05.Rep3
## 26 0.5312500 0.3750000 1.000 Fold06.Rep3
## 27 0.6428571 0.4285714 0.500 Fold07.Rep3
## 28 0.4375000 0.3750000 0.625 Fold08.Rep3
## 29 0.6607143 0.5714286 0.625 Fold09.Rep3
## 30 0.6607143 0.5714286 0.625 Fold10.Rep3
Cross-validation confirms the model achieves mean 0.62 AUC, 0.54 sensitivity and 0.56 specificity. The performance metrics are much weaker than the training model, suggesting the model does not perform well on new patient data and insignificant variables cause inconsistencies in diagnoses across patients.
Using the variable importance levels in Table 3, a simplified model is generated excluding insignificant predictors.
# Tune model removing insignificant variables (OpenAI 2024)
qda_refined <- qda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)
# Use the generated model to predict each of the observations in the training set
qda_ref_prob <- predict(qda_refined, train_df)
# Extract class predictions with 0.4 threshold
qda_ref_pred <- ifelse(qda_ref_prob$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrix
cm <- table(Predicted = qda_ref_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, qda_ref_prob$posterior[, "Yes"])
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 48 13
## Yes 27 67
## Area under the curve: 0.8158
## Sensitivity (0.4 threshold) = 0.8375
## Specificity (0.4 threshold) = 0.64
## Misclassification Error (0.4 threshold) = 0.2580645
# Conduct repeated cross validation on refined simpler model (OpenAI 2024)
set.seed(35602287)
cv_model_subset <- train(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI,
data = train_df,
method = "qda",
metric = "ROC",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
)
print(cv_model_subset)
## Quadratic Discriminant Analysis
##
## 155 samples
## 6 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7029018 0.6071429 0.6166667
The simplified QDA model demonstrates slightly decreased performance across all metrics compared to the full model, with AUC of 0.82, balanced sensitivity of 0.84, decreased specificity of 0.64 and increased error rate of 0.26. Under cross-validation, the simplified model returns a higher average AUC of 0.7, sensitivity at 0.61 and specificity of 0.62, suggesting this model is much stronger in minimising fitting for insignificant risk-factors. The simplification did not alter the volume of at-risk patients being misdiagnosed as non-risk.
To prioritise clinical safety and ability to generalise the model to new unseen patient data accurately, the simplified model with 0.4 threshold will be included in further evaluation and comparison.
The k-Nearest Neighbours model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.
# Convert categorical variables to dummy variables for processing (OpenAI 2024)
dummy <- dummyVars(~ ., data = train_df[, c("Gender", "Diabetes", "Smoking_Status")])
dummy <- predict(dummy, newdata = train_df[, c("Gender", "Diabetes", "Smoking_Status")])
# Standardise numeric variables to equalise weightings
num <- scale(train_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")])
# Combine scaled numeric and dummy variables, ensuring output is a dataframe
train_knn <- data.frame(cbind(num, dummy))
# Train knn model using k=3
set.seed(35602287)
knn_mod <- knn(train_knn, train_knn, train_df$Heart_Attack, k = 3)
# Generate the confusion matrix
cm <- table(Predicted = knn_mod, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, as.numeric(knn_mod))
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 60 20
## Yes 15 60
## Area under the curve: 0.775
## Sensitivity (k=3) = 0.75
## Specificity (k=3) = 0.8
## Misclassification Error (k=3) = 0.2258065
Using k=3, the k-NN model displays moderate discriminatory ability between at-risk and non-risk patients, with an AUC of 0.78, relatively balanced sensitivity of 0.75 and specificity of 0.8 and an error rate of 0.23. Whilst performance metrics demonstrate that the model with k=3 is generally efficient in detecting at-risk patients so they can be put forward for appropriate interventions, testing is conducted to identify the optimal value for k.
# Create loop to find optimal k (OpenAI 2024)
set.seed(35602287)
# Set empty lists for k value and sensitivity results to append whilst looping
k_values <- c()
sens_list <- c()
# Iterate through k values of 1-20
for (k in 1:20) {
# Train the model for that value of k
knn_mod <- knn(train_knn, train_knn, train_df$Heart_Attack, k = k)
# Generate the confusion matrix
cm <- table(Predicted = knn_mod, Actual = train_df$Heart_Attack)
# Calculate sensitivity
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
# Append k values and sensitivity calculations to lists for plotting
k_values <- append(k_values, k)
sens_list <- append(sens_list, sensitivity)
}
plot(k_values, sens_list, pch = 15, col = "Pink")
Figure 1: Sensitivity by k-value
In Figure 1 above, it is clear that k=1 produces a perfect sensitivity rate, ensuring that at-risk patients are consistently identified and do not go undetected for treatment. However, cross-validation is applied to confirm the generalisation of the k=1 model compared to the second strongest k=2 value.
# Conduct cross validation on k=1 (OpenAI 2024)
set.seed(35602287)
knn_cv <- train(
Heart_Attack ~ .,
data = mutate(train_knn, Heart_Attack = train_df$Heart_Attack),
method = "knn",
metric = "ROC",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary),
tuneGrid = data.frame(k = 1)
)
print(knn_cv)
## k-Nearest Neighbors
##
## 155 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.5410714 0.5696429 0.5125
##
## Tuning parameter 'k' was held constant at a value of 1
## ROC Sens Spec Resample
## 1 0.5000000 0.6250000 0.375 Fold01.Rep1
## 2 0.4375000 0.5000000 0.375 Fold02.Rep1
## 3 0.5625000 0.6250000 0.500 Fold03.Rep1
## 4 0.5446429 0.7142857 0.375 Fold04.Rep1
## 5 0.5625000 0.6250000 0.500 Fold05.Rep1
## 6 0.5625000 0.5000000 0.625 Fold06.Rep1
## 7 0.5357143 0.5714286 0.500 Fold07.Rep1
## 8 0.6696429 0.7142857 0.625 Fold08.Rep1
## 9 0.3928571 0.2857143 0.500 Fold09.Rep1
## 10 0.7321429 0.7142857 0.750 Fold10.Rep1
## 11 0.5625000 0.5000000 0.625 Fold01.Rep2
## 12 0.5000000 0.7500000 0.250 Fold02.Rep2
## 13 0.7500000 0.6250000 0.875 Fold03.Rep2
## 14 0.6696429 0.7142857 0.625 Fold04.Rep2
## 15 0.3750000 0.6250000 0.125 Fold05.Rep2
## 16 0.4017857 0.4285714 0.375 Fold06.Rep2
## 17 0.6696429 0.7142857 0.625 Fold07.Rep2
## 18 0.4732143 0.5714286 0.375 Fold08.Rep2
## 19 0.7232143 0.5714286 0.875 Fold09.Rep2
## 20 0.4375000 0.2500000 0.625 Fold10.Rep2
## 21 0.5625000 0.7500000 0.375 Fold01.Rep3
## 22 0.5625000 0.6250000 0.500 Fold02.Rep3
## 23 0.3750000 0.3750000 0.375 Fold03.Rep3
## 24 0.4642857 0.4285714 0.500 Fold04.Rep3
## 25 0.6607143 0.5714286 0.750 Fold05.Rep3
## 26 0.4375000 0.5000000 0.375 Fold06.Rep3
## 27 0.5267857 0.4285714 0.625 Fold07.Rep3
## 28 0.5000000 0.5000000 0.500 Fold08.Rep3
## 29 0.5357143 0.5714286 0.500 Fold09.Rep3
## 30 0.5446429 0.7142857 0.375 Fold10.Rep3
# Conduct cross validation on k=2 (OpenAI 2024)
knn_cv <- train(
Heart_Attack ~ .,
data = mutate(train_knn, Heart_Attack = train_df$Heart_Attack),
method = "knn",
metric = "ROC",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary),
tuneGrid = data.frame(k = 2)
)
print(knn_cv)
## k-Nearest Neighbors
##
## 155 samples
## 13 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 139, 140, 140, 140, 139, 139, ...
## Resampling results:
##
## ROC Sens Spec
## 0.5999628 0.5404762 0.5208333
##
## Tuning parameter 'k' was held constant at a value of 2
## ROC Sens Spec Resample
## 1 0.6093750 0.5000000 0.375 Fold01.Rep1
## 2 0.7946429 1.0000000 0.625 Fold02.Rep1
## 3 0.4553571 0.4285714 0.625 Fold03.Rep1
## 4 0.7589286 0.4285714 0.750 Fold04.Rep1
## 5 0.5546875 0.5000000 0.500 Fold05.Rep1
## 6 0.8125000 0.6250000 0.625 Fold06.Rep1
## 7 0.4609375 0.5000000 0.375 Fold07.Rep1
## 8 0.5892857 0.5714286 0.375 Fold08.Rep1
## 9 0.3035714 0.2857143 0.250 Fold09.Rep1
## 10 0.6562500 0.6250000 0.500 Fold10.Rep1
## 11 0.6428571 0.5714286 0.625 Fold01.Rep2
## 12 0.6428571 0.5714286 0.500 Fold02.Rep2
## 13 0.4910714 0.5714286 0.625 Fold03.Rep2
## 14 0.6406250 0.6250000 0.875 Fold04.Rep2
## 15 0.6562500 0.7500000 0.500 Fold05.Rep2
## 16 0.7578125 0.6250000 0.500 Fold06.Rep2
## 17 0.7031250 0.6250000 0.625 Fold07.Rep2
## 18 0.5000000 0.4285714 0.625 Fold08.Rep2
## 19 0.2968750 0.3750000 0.500 Fold09.Rep2
## 20 0.7589286 0.2857143 0.500 Fold10.Rep2
## 21 0.5390625 0.5000000 0.375 Fold01.Rep3
## 22 0.8750000 0.6250000 0.750 Fold02.Rep3
## 23 0.6517857 0.8571429 0.500 Fold03.Rep3
## 24 0.5000000 0.2857143 0.375 Fold04.Rep3
## 25 0.5000000 0.4285714 0.375 Fold05.Rep3
## 26 0.6517857 0.2857143 0.500 Fold06.Rep3
## 27 0.6875000 0.7142857 0.250 Fold07.Rep3
## 28 0.5468750 0.5000000 0.750 Fold08.Rep3
## 29 0.4609375 0.3750000 0.375 Fold09.Rep3
## 30 0.5000000 0.7500000 0.500 Fold10.Rep3
Cross-validation shows lower variability across folds for k=1 with moderate metrics (AUC 0.54, sensitivity 0.57 and specificity 0.51), whereas folds are more highly variable for k=2 with similar metrics (AUC 0.6, sensitivity 0.54 and specificity 0.52). The low variability in fold metrics indicates that the k=1 model performance is stable with the variables included, and all provide some level of statistical significance.
The k=1 model will be included in further evaluation and comparison.
The decision tree model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.
# Build the initial decision tree model using selected variables
big_tree <- rpart(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data = train_df)
# Plot the decision tree
rpart.plot(big_tree, box.palette = "Purples")
Figure 2: Decision Tree for Heart Attack Risk
##
## Classification tree:
## rpart(formula = Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status +
## Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides +
## BMI + Resting_ECG, data = train_df)
##
## Variables actually used in tree construction:
## [1] Age Alcohol_Consumption Smoking_Status
## [4] Systolic_BP Triglycerides
##
## Root node error: 75/155 = 0.48387
##
## n= 155
##
## CP nsplit rel error xerror xstd
## 1 0.306667 0 1.00000 1.13333 0.082610
## 2 0.120000 1 0.69333 0.78667 0.080600
## 3 0.036667 2 0.57333 0.61333 0.075834
## 4 0.026667 6 0.42667 0.72000 0.079092
## 5 0.020000 7 0.40000 0.78667 0.080600
## 6 0.010000 9 0.36000 0.84000 0.081534
In Figure 2 above, there are a total of 19 nodes generated with 10 terminal nodes and 9 splits in the tree. The model automatically selected Systolic_BP, Age, Alcohol_Consumption, Triglycerides and Smoking_Status as the core predictors for heart attack.
The optimal complexity parameter with the lowest xerror is identified as 0.036667 to further prune the tree.
# Prune the tree
rpart_pruned <- prune(big_tree,cp=0.036667)
rpart.plot(rpart_pruned, box.palette = "Purples")
Figure 3: Pruned Decision Tree for Heart Attack Risk
In Figure 3 above, there are a total of 5 nodes generated with 3 terminal nodes and 2 splits in the tree based on the Systolic_BP < 127 and Age < 60 rules.
This pruned model simplifies predictions by initially checking if the patient meets the Systolic_BP < 127 rule. The end node in the middle indicates that patients in this node initially met the Systolic_BP < 127 rule, however did not meet the next tier rule of Age < 60. Therefore, patients in this group are predicted to suffer a heart attack, with the probability of heart attack being 0.76 and the probability of no heart attack is 0.24. 11% of the patients in the training sample reached this leaf node.
# Generate predictions
tree_pred <- predict(rpart_pruned,train_df,type = 'class')
# Generate the confusion matrix
cm <- table(Predicted = tree_pred, Actual = train_df$Heart_Attack)
# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, as.numeric(tree_pred))
# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm
## Actual
## Predicted No Yes
## No 44 12
## Yes 31 68
## Area under the curve: 0.7183
## Sensitivity = 0.85
## Specificity = 0.5866667
## Misclassification Error = 0.2774194
The pruned tree demonstrates moderate predictive ability, returning 0.72 AUC, 0.85 sensitivity, 0.59 specificity and 0.28 miscassification. The model favours accurate prediction of at-risk patients at the expense of misdiagnosing non-risk cases. This is preferred, as the alternative may mean that life-threatening diagnoses are missed and clinical interventions are not implemented in time.
Following the optimal complexity parameter, the pruned model will be included in further evaluation and comparison.
Each predictive model is evaluated based on performance across both the train and test datasets.
# Create the backward step logistic model again
back_mod <- glm(Heart_Attack ~ Age + Diabetes + Systolic_BP + BMI + Triglycerides, data=train_df, family = binomial)
# Use the generated model to predict each of the observations in the training & test sets
log_mod_prob_train <- predict(back_mod, train_df, type = "response")
log_mod_prob_test <- predict(back_mod, test_df, type = "response")
# Convert prediction values into Yes or No with a 0.4 threshold
log_mod_pred_train <- ifelse(log_mod_prob_train > 0.4, "Yes", "No")
log_mod_pred_test <- ifelse(log_mod_prob_test > 0.4, "Yes", "No")
# Generate the confusion matrices
cm_train <- table(Predicted = log_mod_pred_train, Actual = train_df$Heart_Attack)
cm_test <- table(Predicted = log_mod_pred_test, Actual = test_df$Heart_Attack)
# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, log_mod_prob_train)
roc_curve_test <- roc(test_df$Heart_Attack, log_mod_prob_test)
Train Dataset Performance
# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train
## Actual
## Predicted No Yes
## No 43 10
## Yes 32 70
## Area under the curve: 0.7842
## Sensitivity (train) = 0.875
## Specificity (train) = 0.5733333
## Accuracy (train) = 0.7290323
## Misclassification Error (train) = 0.2709677
Test Dataset Performance
# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test
## Actual
## Predicted No Yes
## No 21 9
## Yes 15 22
## Area under the curve: 0.6944
## Sensitivity (test) = 0.7096774
## Specificity (test) = 0.5833333
## Accuracy (test) = 0.641791
## Misclassification Error (test) = 0.358209
The logistic regression model demonstrated moderate discriminative ability, with test set sensitivity of 0.71 and specificity of 0.58, achieving an overall accuracy of 0.64 and AUC of 0.69. This indicates the model correctly identifies a majority of patients at risk but has a tendency to misclassify some non-risk patients, reflecting a balanced but imperfect trade-off between sensitivity and specificity.
# Create the simplified lda model again
lda_refined <- lda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)
# Use the generated model to predict each of the observations in the training & test sets
lda_mod_prob_train <- predict(lda_refined, train_df)
lda_mod_prob_test <- predict(lda_refined, test_df)
# Convert prediction values into Yes or No with a 0.4 threshold
lda_mod_pred_train <- ifelse(lda_mod_prob_train$posterior[,"Yes"] > 0.4, "Yes", "No")
lda_mod_pred_test <- ifelse(lda_mod_prob_test$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrices
cm_train <- table(Predicted = lda_mod_pred_train, Actual = train_df$Heart_Attack)
cm_test <- table(Predicted = lda_mod_pred_test, Actual = test_df$Heart_Attack)
# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, lda_mod_prob_train$posterior[,"Yes"])
roc_curve_test <- roc(test_df$Heart_Attack, lda_mod_prob_test$posterior[,"Yes"])
Train Dataset Performance
# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train
## Actual
## Predicted No Yes
## No 43 12
## Yes 32 68
## Area under the curve: 0.7925
## Sensitivity (train) = 0.85
## Specificity (train) = 0.5733333
## Accuracy (train) = 0.716129
## Misclassification Error (train) = 0.283871
Test Dataset Performance
# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test
## Actual
## Predicted No Yes
## No 18 7
## Yes 18 24
## Area under the curve: 0.6873
## Sensitivity (test) = 0.7741935
## Specificity (test) = 0.5
## Accuracy (test) = 0.6268657
## Misclassification Error (test) = 0.3731343
The LDA model yielded slightly higher sensitivity on the test set (0.77) but lower specificity (0.50), resulting in an accuracy of 0.63 and AUC of 0.69. Its performance suggests that while LDA is more aggressive at identifying potential heart attack cases, it sacrifices specificity, potentially leading to more false positives. This trade-off may be acceptable due to the hospital prioritising early detection and intervention.
# Create the refined qda model again
qda_refined <- qda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)
# Use the generated model to predict each of the observations in the training & test sets
qda_mod_prob_train <- predict(qda_refined, train_df)
qda_mod_prob_test <- predict(qda_refined, test_df)
# Convert prediction values into Yes or No with a 0.4 threshold
qda_mod_pred_train <- ifelse(qda_mod_prob_train$posterior[,"Yes"] > 0.4, "Yes", "No")
qda_mod_pred_test <- ifelse(qda_mod_prob_test$posterior[,"Yes"] > 0.4, "Yes", "No")
# Generate the confusion matrices
cm_train <- table(Predicted = qda_mod_pred_train, Actual = train_df$Heart_Attack)
cm_test <- table(Predicted = qda_mod_pred_test, Actual = test_df$Heart_Attack)
# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, qda_mod_prob_train$posterior[,"Yes"])
roc_curve_test <- roc(test_df$Heart_Attack, qda_mod_prob_test$posterior[,"Yes"])
Train Dataset Performance
# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train
## Actual
## Predicted No Yes
## No 48 13
## Yes 27 67
## Area under the curve: 0.8158
## Sensitivity (train) = 0.8375
## Specificity (train) = 0.64
## Accuracy (train) = 0.7419355
## Misclassification Error (train) = 0.2580645
Test Dataset Performance
# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test
## Actual
## Predicted No Yes
## No 19 9
## Yes 17 22
## Area under the curve: 0.6676
## Sensitivity (test) = 0.7096774
## Specificity (test) = 0.5277778
## Accuracy (test) = 0.6119403
## Misclassification Error (test) = 0.3880597
The QDA model produced comparable test sensitivity (0.71) but marginally improved specificity (0.53) relative to LDA, with overall accuracy of 0.61 and AUC of 0.67. While QDA captures non-linear relationships better than LDA, in this dataset the improvement in specificity is minor, and overall predictive accuracy remains moderate.
# Convert categorical variables to dummy variables for test dataframe
dummy <- dummyVars(~ ., data = test_df[, c("Gender", "Diabetes", "Smoking_Status")])
dummy <- predict(dummy, newdata = test_df[, c("Gender", "Diabetes", "Smoking_Status")])
# Standardise numeric variables to equalise weightings
num <- scale(test_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")])
# Combine scaled numeric and dummy variables, ensuring output is a dataframe
test_knn <- data.frame(cbind(num, dummy))
# Create the knn model and predictions with k=1 again
set.seed(35602287)
knn_mod_train <- knn(train_knn, train_knn, train_df$Heart_Attack, k = 1)
knn_mod_test <- knn(train_knn, test_knn, train_df$Heart_Attack, k = 1)
# Generate the confusion matrices
cm_train <- table(Predicted = knn_mod_train, Actual = train_df$Heart_Attack)
cm_test <- table(Predicted = knn_mod_test, Actual = test_df$Heart_Attack)
# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, as.numeric(knn_mod_train))
roc_curve_test <- roc(test_df$Heart_Attack, as.numeric(knn_mod_test))
Train Dataset Performance
# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train
## Actual
## Predicted No Yes
## No 75 0
## Yes 0 80
## Area under the curve: 1
## Sensitivity (train) = 1
## Specificity (train) = 1
## Accuracy (train) = 1
## Misclassification Error (train) = 0
Test Dataset Performance
# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test
## Actual
## Predicted No Yes
## No 20 12
## Yes 16 19
## Area under the curve: 0.5842
## Sensitivity (test) = 0.6129032
## Specificity (test) = 0.5555556
## Accuracy (test) = 0.5820896
## Misclassification Error (test) = 0.4179104
The k-NN model achieved perfect performance on the training data (accuracy and AUC of 1), but its test set performance declined markedly (accuracy 0.58, sensitivity 0.61, specificity 0.56, AUC 0.58). This instability limits its reliability in practice despite strong training performance as the k=1 parameter is limiting with new patient data.
# Create the pruned tree model predictions again
tree_pred_train <- predict(rpart_pruned,train_df,type = 'class')
tree_pred_test <- predict(rpart_pruned,test_df,type = 'class')
# Generate the confusion matrices
cm_train <- table(Predicted = tree_pred_train, Actual = train_df$Heart_Attack)
cm_test <- table(Predicted = tree_pred_test, Actual = test_df$Heart_Attack)
# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, as.numeric(tree_pred_train))
roc_curve_test <- roc(test_df$Heart_Attack, as.numeric(tree_pred_test))
Train Dataset Performance
# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train
## Actual
## Predicted No Yes
## No 44 12
## Yes 31 68
## Area under the curve: 0.7183
## Sensitivity (train) = 0.85
## Specificity (train) = 0.5866667
## Accuracy (train) = 0.7225806
## Misclassification Error (train) = 0.2774194
Test Dataset Performance
# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test
## Actual
## Predicted No Yes
## No 15 3
## Yes 21 28
## Area under the curve: 0.6599
## Sensitivity (test) = 0.9032258
## Specificity (test) = 0.4166667
## Accuracy (test) = 0.641791
## Misclassification Error (test) = 0.358209
The decision tree model demonstrated high sensitivity (0.90) but low specificity (0.42) on the test set, yielding accuracy of 0.64 and AUC of 0.66. The tree’s high sensitivity makes it effective for detecting high-risk patients, but the trade-off is a higher false-positive rate, which could burden clinical resources with unnecessary interventions and time-wasting activities.
# Create a data frame with test set metrics
model_metrics <- tibble(
Model = c("Logistic Regression", "LDA", "QDA", "k-NN (k=1)", "Decision Tree"),
Sensitivity = c(0.710, 0.774, 0.710, 0.613, 0.903),
Specificity = c(0.583, 0.500, 0.528, 0.556, 0.417),
Accuracy = c(0.642, 0.627, 0.612, 0.582, 0.642),
AUC = c(0.694, 0.687, 0.668, 0.584, 0.660)
)
# Round numeric values for better readability
model_metrics <- model_metrics %>%
mutate(across(Sensitivity:Accuracy, ~ round(., 3)),
AUC = round(AUC, 3))
# Display table
kable(model_metrics, caption = "Test Set Performance Metrics Across Predictive Models")
Model | Sensitivity | Specificity | Accuracy | AUC |
---|---|---|---|---|
Logistic Regression | 0.710 | 0.583 | 0.642 | 0.694 |
LDA | 0.774 | 0.500 | 0.627 | 0.687 |
QDA | 0.710 | 0.528 | 0.612 | 0.668 |
k-NN (k=1) | 0.613 | 0.556 | 0.582 | 0.584 |
Decision Tree | 0.903 | 0.417 | 0.642 | 0.660 |
Considering the hospital’s priority to identify at-risk patients early while maintaining reasonable reliability, the LDA model provides the best balance of sensitivity and specificity, with consistent performance across train and test datasets. While the decision tree offers higher sensitivity, its reduced specificity may lead to over-identification. The logistic model is stable, however provides less aggressive sensitivity leading to more missed risk patients than the LDA model. Therefore, LDA is recommended as the primary predictive model for practical implementation, offering interpretable coefficients, stable performance, and actionable insights for clinical decision-making.
Decision boundary graphs for each model assist hospital employees to succinctly categorise new patients’ heart-attack risk visually, based on their individual recordings for the two most important numerical variables.
Table 3 indicates that Systolic_BP and Age are the two highest weighted variables in LDA and QDA models, aligning with the findings in Section 2.1 where these variables produced the most significant p-values in the logistic model. The decision tree only includes rules regarding Systolic_BP and Age, so these are evidently the most important factors for that model as well.
To clarify importance levels across all variables generally, a t-test is applied.
# Split out all numeric variables
num_vars <- train_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")]
# Run a t-test to generate p-values
pvals <- apply(num_vars, 2, function(x) t.test(x ~ train_df$Heart_Attack)$p.value)
# Sort ascending so most significant value is first
sort(pvals)
## Systolic_BP Age BMI Triglycerides
## 0.0001482591 0.0008846272 0.0714552313 0.0763155803
## LDL_Cholesterol Resting_ECG Alcohol_Consumption
## 0.3056919644 0.5504942133 0.6015558653
The t-test confirms that Systolic_BP and Age produce the lowest p-values, validating their inclusion in the decision boundaries as the two most statistically significant risk-factors for predicting heart attack cases.
# Create sequence for range of values for Systolic_BP and Age
gx1 <- seq(91, 174, by = 1)
gx2 <- seq(20, 80, by = 0.7)
# Create the grid with the sequenced ranges and update column names
grid <- expand.grid(gx1, gx2)
colnames(grid) <- c('Systolic_BP', 'Age')
# Replicate a single row from the training data to add all other predictors into the grid
# Avoids errors from trained models including more variables than are in the grid
template <- train_df[rep(1, nrow(grid)), ]
# Reset the values for Systolic_BP and Age to the sequenced ranges
template$Systolic_BP <- grid$Systolic_BP
template$Age <- grid$Age
# Use the trained model to predict on the grid template
log_grid <- predict(back_mod, template)
# Determine prediction classifications using the 0.4 threshold
log_grid_class <- ifelse(log_grid > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
add_column(Class = log_grid_class) %>%
ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
geom_point(size = 1) +
scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))
Figure 4: Logistic Model Decision Boundary
The logistic regression boundary exhibits a smooth, linear separation between Systolic_BP and Age, reflecting that older patients with higher BP readings are at greater risk (over 50 year old and over 130 BP).
# Use the trained model to predict on the grid template
lda_grid <- predict(lda_refined, template)
# Determine prediction classifications using the 0.4 threshold
lda_grid_class <- ifelse(lda_grid$posterior[,"Yes"] > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
add_column(Class = lda_grid_class) %>%
ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
geom_point(size = 1) +
scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))
Figure 5: LDA Model Decision Boundary
LDA produces a similar linear boundary but with a slightly different slope, as it assumes equal covariance across classes, resulting in more aggressive classification of at-risk patients (over 60 years old ond over 120 BP).
# Use the trained model to predict on the grid template
qda_grid <- predict(qda_refined, template)
# Determine prediction classifications using the 0.4 threshold
qda_grid_class <- ifelse(qda_grid$posterior[,"Yes"] > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
add_column(Class = qda_grid_class) %>%
ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
geom_point(size = 1) +
scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))
Figure 6: QDA Model Decision Boundary
QDA, by contrast, allows for quadratic boundaries, which can captures curved boundary lines. This is evident in areas of higher Systolic_BP combined with lower Age (20-40), where QDA predicts “Yes” in pockets where LDA and logistic regression still predict “No.”
# Identify numeric columns used in train_knn
num_cols <- c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides",
"BMI", "Resting_ECG", "Alcohol_Consumption")
# Identify dummy columns (all others)
dummy_cols <- setdiff(colnames(train_knn), num_cols)
# Create grid for Systolic_BP and Age (OpenAI 2024)
gx1knn <- seq(min(train_df$Systolic_BP), max(train_df$Systolic_BP), by = 1)
gx2knn <- seq(min(train_df$Age), max(train_df$Age), by = 1)
gridknn <- expand.grid(Systolic_BP = gx1knn, Age = gx2knn)
# Fix other numeric features at mean
fixed_num <- colMeans(train_knn[, num_cols[!(num_cols %in% c("Systolic_BP","Age"))]])
# Fix dummy features at mode (0 or 1)
fixed_dummy <- apply(train_knn[, dummy_cols], 2, function(x) { round(mean(x)) })
# Create template for prediction
template2 <- cbind(
gridknn,
matrix(rep(fixed_num, each = nrow(gridknn)), ncol = length(fixed_num), byrow = FALSE),
matrix(rep(fixed_dummy, each = nrow(gridknn)), ncol = length(fixed_dummy), byrow = FALSE)
)
colnames(template2) <- colnames(train_knn)
# Scale numeric columns only
template2[, num_cols] <- scale(template2[, num_cols],
center = colMeans(train_knn[, num_cols]),
scale = apply(train_knn[, num_cols], 2, sd))
# Predict with k-NN
knn_grid <- knn(train_knn, template2, train_df$Heart_Attack, k = 1)
# Plot decision boundary
template2 %>%
as_tibble() %>%
add_column(Class = knn_grid) %>%
ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
geom_point(size = 1) +
scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))
Figure 7: k-NN Model Decision Boundary
The k-NN boundary is highly flawed capturing all predictions as at-risk patients of heart attacks. This due to instability in coding using k=1 and may misclassify due to only relying on a single datapoint.
# Use the trained model to predict on the grid template
tree_grid <- predict(rpart_pruned, template, type="class")
# Plot the decision boundary
template %>% as_tibble() %>%
add_column(Class = tree_grid) %>%
ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
geom_point(size = 1) +
scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))
Figure 8: QDA Model Decision Boundary
The decision tree displays a stepwise, axis-aligned boundary, reflecting its hierarchical rule-based splits. It clearly separates the step boundary for all patients > 127 BP and > 60 being categorised as heart attack risks.
Linear models like logistic regression and LDA offer stable, generalisable boundaries but may underfit subtle non-linear relationships. QDA can capture these non-linearities but at the cost of increased variance, which is not desirable when the consequence is missing a heart attack risk. Decision trees provide clear, interpretable rules but sacrifice flexibility and deeper hidden relationships between the heart attack variables. k-NN is limited in generalisation to data outside of the training set.
Overall, for hospital risk assessment prioritising both interpretability and accurate prediction, logistic regression or LDA are recommended, providing consistent, clinically sensible boundaries. Other models are too unreliable for making life or death decisions.
The analysis confirms Systolic_BP and Age as the most influential predictors of heart-attack risk, supported by both statistical tests and the visualised decision boundaries. The logistic model provides a smooth linear boundary with balanced performance (sensitivity 0.71, specificity 0.58, accuracy 0.64, AUC 0.69), making it reliable and interpretable for clinical decision-making for patients to feel at ease about confirming their risk of heart attack. The LDA model increases sensitivity to 0.77 but lowers specificity to 0.50 (accuracy 0.63, AUC 0.69), reflecting a slightly more aggressive identification of at-risk patients, which is a crucial positive in ensuring early intervention is achieved for patients at-risk. QDA captures non-linear patterns, achieving sensitivity 0.71, specificity 0.53, accuracy 0.61, and AUC 0.67, but with moderate overall performance. The decision tree prioritises sensitivity (0.90) but has reduced specificity (0.42, accuracy 0.64, AUC 0.66), while k-NN is limited and unreliable in making health decisions which save lives.
For clinical implementation, the LDA model is stable is recommended to improve sensitivity whilst maintaining reasonable specificity, and interpretability, making it the most suitable model for hospital use so that all patients at-risk of heart attack are accurately diagnosed and appropriate intervention plans can be promptly implemented.