LEO5314 - Predictive Modelling of Heart Attack Risk in Hospital Patients

1 Data Preparation

heart_data <- read.csv("heart_data.csv")
# Create data subset to ensure equal representation of heart attacks and non-heart attacks
# Sum number of non-heart attacks
attack_no <- sum(heart_data$Heart_Attack == "No")
# Filter to all non-heart attacks
heart_data_no <- filter(heart_data, Heart_Attack == "No")
# Filter to all heart attacks and then random sample to equal non-heart attack number
set.seed(35602287)
heart_data_yes <- heart_data %>% filter(Heart_Attack == "Yes") %>% sample_n(attack_no)
# Combine the yes and no datasets
heart_data <- rbind(heart_data_no, heart_data_yes)

This report explores the correlation between various cardiovascular risk factors and utilises predictive modelling techniques to ascertain the likelihood of heart attacks in hospital patients.

1.1 Data Set Structure

A subset of the heart data set (Monash Online, 2025) will be analysed in this report, as the original dataset is highly skewed towards patients who suffered heart attacks. A such, a subset has been selected containing all non-heart attack observations and a random, equal selection of heart attack observations.

The data subset contains 222 observations across 21 variables. Each observation contains the recordings of cardiovascular risk factor values for individual patients, and whether that patient experienced a heart attack.

The variables within the heart data set can be observed in Table 1 below:

# Use Kable to create a table of the variable names and their type
# Remove row names so that each variable name does not double as a row name
knitr::kable(data.frame(
  Variable = names(heart_data),
  Type = sapply(heart_data, class),
  row.names = NULL), caption = "Heart Data Set Variables")

Table 1: Heart Data Set Variables
Variable	Type
Heart_Attack	character
Gender	character
Family_History	character
Diabetes	character
Obesity	character
Smoking_Status	character
Physical_Inactivity	character
Age	integer
Alcohol_Consumption	integer
Systolic_BP	integer
Total_Cholesterol	integer
HDL_Cholesterol	integer
LDL_Cholesterol	integer
Triglycerides	integer
Fasting_Glucose	integer
BMI	numeric
Waist_Circumference	integer
Heart_Rate	integer
Resting_ECG	integer
Physical_Activity_Minutes	integer
High_BP	character

Table 1 highlights that 13 of the variables have numerical values and 7 have categorical values for analysis and modelling purposes. Each risk factor, or variable, captures different demographic, diagnostic or lifestyle information which may contribute to the patient’s likelihood of experiencing a heart attack. The response variable indicates whether the patient experienced a heart attack.

1.2 Data Set Summary

The summary statistics for each variable can be observed in Table 2 below:

# Convert character variables to factors so that frequencies can be calculated (OpenAI 2024)
heart_data[] <- lapply(heart_data, function(x) {if (is.character(x)) as.factor(x) else x})
# Use Kable to create a summary table of the heart data
knitr::kable(summary(heart_data), caption = "Heart Data Summary Statistics", digits = 2)

Table 2: Heart Data Summary Statistics
Heart_Attack	Gender	Family_History	Diabetes	Obesity	Smoking_Status	Physical_Inactivity	Age	Alcohol_Consumption	Systolic_BP	Total_Cholesterol	HDL_Cholesterol	LDL_Cholesterol	Triglycerides	Fasting_Glucose	BMI	Waist_Circumference	Heart_Rate	Resting_ECG	Physical_Activity_Minutes	High_BP
No :111	Female:109	No :155	No :189	No :160	Non-smoker:167	No :129	Min. :20.00	Min. : -8.00	Min. : 91.0	Min. :120	Min. :16.00	Min. : 45.0	Min. : 26.0	Min. : 51.00	Min. :15.8	Min. : 62.00	Min. : 51.00	Min. :0.0000	Min. :-45.0	No :120
Yes:111	Male :113	Yes: 67	Yes: 33	Yes: 62	Smoker : 55	Yes: 93	1st Qu.:30.00	1st Qu.: 60.00	1st Qu.:116.2	1st Qu.:172	1st Qu.:42.00	1st Qu.:105.0	1st Qu.:123.2	1st Qu.: 89.25	1st Qu.:23.7	1st Qu.: 85.25	1st Qu.: 67.25	1st Qu.:0.0000	1st Qu.: 88.5	Yes:102
NA	NA	NA	NA	NA	NA	NA	Median :43.00	Median : 85.00	Median :128.0	Median :198	Median :48.50	Median :119.0	Median :151.0	Median : 99.00	Median :26.4	Median : 94.00	Median : 74.50	Median :0.0000	Median :123.5	NA
NA	NA	NA	NA	NA	NA	NA	Mean :45.05	Mean : 81.62	Mean :127.5	Mean :195	Mean :48.17	Mean :120.9	Mean :148.6	Mean : 98.49	Mean :26.6	Mean : 94.34	Mean : 75.17	Mean :0.4054	Mean :122.7	NA
NA	NA	NA	NA	NA	NA	NA	3rd Qu.:59.00	3rd Qu.:104.75	3rd Qu.:138.0	3rd Qu.:214	3rd Qu.:53.00	3rd Qu.:138.8	3rd Qu.:173.8	3rd Qu.:109.00	3rd Qu.:29.5	3rd Qu.:103.00	3rd Qu.: 82.00	3rd Qu.:1.0000	3rd Qu.:156.0	NA
NA	NA	NA	NA	NA	NA	NA	Max. :80.00	Max. :150.00	Max. :174.0	Max. :315	Max. :75.00	Max. :189.0	Max. :267.0	Max. :132.00	Max. :37.8	Max. :130.00	Max. :105.00	Max. :2.0000	Max. :276.0	NA

Table 2 observes that within the heart data set, there are an equal number of heart attack and non-heart attack cases for balanced analysis. Demographically, there are a similar number of male and female patients with ages ranging from 20 to 80. The mean age of 45.05 indicates that the group is predominantly middle-aged.

Regarding diagnostic categorical variables, 70% of patients do not have family history of heart attacks, 85% are non-diabetic, 72% are not classified as obese, 75% are non-smokers, 58% are physically active and 54% do not have high blood-pressure suggesting there is a mix of patients with both healthy and “at-risk” recordings.

Alcohol consumption within the sample ranges widely, from -8 grams to 150 grams per week, however the mean is slightly lower than the median, at 81.62 grams indicating that many patients are not drinking at extreme levels. Systolic blood pressure and LDL cholesterol have balanced mean and medians of approximately 127.75mmHg and 120.4mg/DL respectively, suggesting normal distributions ideal for modelling both normal and elevated levels. Triglyceride recordings range from 26mg/DL to 267mg/DL, indicating that bodily fat levels are largely fluctuating between individuals. BMI results show a mean of 26.6kg/m² and median of 26.4kg/m² among the sample, indicating that most patients are overweight (National Heart, Lung, and Blood Institute, 2025). The mean of resting ECG test scores is 0.405 with a median of 0, indicating that most patients recorded normal to mild results.

In this report, predictive models will be developed to classify whether a patient may be at risk of a heart attack based on the values of the most important risk factors. The variables in this data set considered most relevant for predicting the likelihood of a heart attack include:

Gender and Age - The risk of heart attack is known to be greater in males than females at younger ages, with the average age of heart attacks in men being 65, whilst in women it is 72 (Harvard Health Publishing, 2016).
Diabetes - Can increase the risk of heart attack due to damaged blood vessels and nerves around the heart caused from high blood glucose (National Institute of Diabetes and Digestive and Kidney Diseases, 2021). This is especially important when predicting heart attack risk at younger ages.
Smoking_Status and Alcohol_Consumption - Tobacco and alcohol consumption in excess are known to cause increased blood pressure and triglycerides leading to higher heart attack risk (Mukamal, 2006).
Systolic_BP - Heightened systolic blood pressure is typical as people age due to arteries stiffening, plaque building up and heightening of heart and blood vessel diseases, causing greater pressure on artery walls during heartbeats and thus increasing heart attack risk (American Heart Association, 2025).
LDL_Cholesterol - Levels over 100mg/dL increase heart attack risk as this contributes to plaque build up in the arteries (Cleveland Clinic, 2024).
Triglycerides - Inflated levels, 100mg/DL in males and 200mg/dL in females, are associated with increased heart attack risk (Aberra et al., 2020) due to higher fat content in blood leading to artery wall thickening.
BMI - Patients who have a body mass index considered overweight (above 25.0) have a significantly greater risk of suffering a heart attack at an earlier age than those with normal BMI (National Heart, Lung, and Blood Institute, 2025), with 69% of adults who have heart, stroke or vascular disease being classified as overweight or obese (Australian Institute of Health and Welfare, 2024).
Resting_ECG - Studies suggest that electrocardiograms can be more effective at predicting heart attack risk than traditional methods (Villines, 2023) due to their ability to detect abnormalities such as heart blockages, damage to chambers and arrhythmia. These warning signs are crucial in predicting future heart attacks in patients.

# Use student ID to set random seed
set.seed(35602287)
# Calculate number of heart data observations for splitting
size <- nrow(heart_data)

# Create the training index for 70% of observations
train_index <- sample(1:size, size * 0.7)
# Create training data set for 70% of observations
train_df <- heart_data[train_index, ]
# Create test data set for remaining 30% of observations
test_df  <- heart_data[-train_index, ]

2 Predictive Models

2.1 Logistic Model

The logistic regression model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.

# Build logistic model using specified variables
log_mod <- glm(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df, family = binomial)

# Use the generated model to predict each of the observations in the training set
log_mod_prob <- predict(log_mod, train_df, type = "response")

# Convert prediction values into Yes or No with a 0.5 threshold
log_mod_pred <- ifelse(log_mod_prob > 0.5, "Yes", "No")

With a 0.5 threshold, the logistic model predicts 78 heart attacks and 77 non-heart attacks, which reflects the 50/50 split in the original dataset.

# Generate the confusion matrix
cm <- table(Predicted = log_mod_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics
roc_curve <- roc(train_df$Heart_Attack, log_mod_prob)

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  54  23
##       Yes 21  57

plot(roc_curve, col="pink", main = "ROC Curve - Logistic Model")

auc(roc_curve)

## Area under the curve: 0.7938

cat("Sensitivity =", sensitivity, "\n")

## Sensitivity = 0.7125

cat("Specificity =", specificity, "\n")

## Specificity = 0.72

cat("Misclassification Error =", misclassification, "\n")

## Misclassification Error = 0.283871

The logistic model has reasonable predictive ability with an AUC of 0.79 from the ROC curve and balanced sensitivity (0.71) and specificity (0.72). The misclassification error is also relatively low at 0.28, highlighting that the model successfully discriminates between at-risk and non-risk patients. For model tuning and clinical relevance, the threshold is reduced to 0.4 to prioritise sensitivity and ensure that the likelihood of missing a high-risk patient is minimised.

# Convert prediction values into Yes or No with a 0.4 threshold
log_mod_pred2 <- ifelse(log_mod_prob > 0.4, "Yes", "No")

# Generate the confusion matrix
cm2 <- table(Predicted = log_mod_pred2, Actual = train_df$Heart_Attack)

# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2

##          Actual
## Predicted No Yes
##       No  41  11
##       Yes 34  69

cat("Sensitivity (0.4 threshold) =", sensitivity2, "\n")

## Sensitivity (0.4 threshold) = 0.8625

cat("Specificity (0.4 threshold) =", specificity2, "\n")

## Specificity (0.4 threshold) = 0.5466667

cat("Misclassification Error (0.4 threshold) =", misclassification2, "\n")

## Misclassification Error (0.4 threshold) = 0.2903226

At the reduced threshold, sensitivity increases to 0.86 with specificity reducing to 0.55 and the misclassification error rising slightly to 0.29. This sacrifice in specificity and misclassification is clinically justified due to the importance of detecting actual heart attack patients over misdiagnosing non-heart attack patients.

Both forward and backward stepwise variable selection is applied to determine whether the variables incorporated in the model are all significant predictors for heart attack risk.

# Conduct forward stepwise variable selection
heart_forward <- step(log_mod, direction = "forward")

## Start:  AIC=192.51
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + 
##     Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG

# Conduct backward stepwise variable selection
heart_backward <- step(log_mod, direction = "backward")

## Start:  AIC=192.51
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + 
##     Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG
## 
##                       Df Deviance    AIC
## - Resting_ECG          1   170.78 190.78
## - LDL_Cholesterol      1   170.84 190.84
## - Gender               1   171.08 191.08
## - Smoking_Status       1   171.47 191.47
## - Alcohol_Consumption  1   171.47 191.47
## <none>                     170.51 192.51
## - Triglycerides        1   173.28 193.28
## - BMI                  1   174.81 194.81
## - Diabetes             1   176.82 196.82
## - Systolic_BP          1   183.40 203.40
## - Age                  1   188.49 208.49
## 
## Step:  AIC=190.78
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + 
##     Systolic_BP + LDL_Cholesterol + Triglycerides + BMI
## 
##                       Df Deviance    AIC
## - LDL_Cholesterol      1   171.10 189.10
## - Gender               1   171.38 189.38
## - Smoking_Status       1   171.72 189.72
## - Alcohol_Consumption  1   171.73 189.73
## <none>                     170.78 190.78
## - Triglycerides        1   173.56 191.56
## - BMI                  1   175.05 193.05
## - Diabetes             1   176.94 194.94
## - Systolic_BP          1   183.75 201.75
## - Age                  1   189.01 207.01
## 
## Step:  AIC=189.1
## Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + 
##     Systolic_BP + Triglycerides + BMI
## 
##                       Df Deviance    AIC
## - Gender               1   171.91 187.91
## - Smoking_Status       1   171.92 187.92
## - Alcohol_Consumption  1   172.04 188.04
## <none>                     171.10 189.10
## - Triglycerides        1   173.96 189.96
## - BMI                  1   175.41 191.41
## - Diabetes             1   177.08 193.08
## - Systolic_BP          1   184.85 200.85
## - Age                  1   189.26 205.26
## 
## Step:  AIC=187.91
## Heart_Attack ~ Age + Diabetes + Smoking_Status + Alcohol_Consumption + 
##     Systolic_BP + Triglycerides + BMI
## 
##                       Df Deviance    AIC
## - Smoking_Status       1   172.77 186.77
## - Alcohol_Consumption  1   172.86 186.86
## <none>                     171.91 187.91
## - Triglycerides        1   175.16 189.16
## - BMI                  1   176.48 190.48
## - Diabetes             1   178.31 192.31
## - Systolic_BP          1   186.43 200.43
## - Age                  1   190.12 204.12
## 
## Step:  AIC=186.77
## Heart_Attack ~ Age + Diabetes + Alcohol_Consumption + Systolic_BP + 
##     Triglycerides + BMI
## 
##                       Df Deviance    AIC
## - Alcohol_Consumption  1   173.67 185.67
## <none>                     172.77 186.77
## - Triglycerides        1   176.52 188.52
## - BMI                  1   177.54 189.54
## - Diabetes             1   179.09 191.09
## - Systolic_BP          1   186.93 198.93
## - Age                  1   191.64 203.64
## 
## Step:  AIC=185.67
## Heart_Attack ~ Age + Diabetes + Systolic_BP + Triglycerides + 
##     BMI
## 
##                 Df Deviance    AIC
## <none>               173.67 185.67
## - Triglycerides  1   177.37 187.37
## - BMI            1   178.70 188.70
## - Diabetes       1   179.46 189.46
## - Systolic_BP    1   187.63 197.63
## - Age            1   191.92 201.92

summary(heart_forward)

## 
## Call:
## glm(formula = Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + 
##     Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + 
##     BMI + Resting_ECG, family = binomial, data = train_df)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -12.656867   2.800500  -4.520  6.2e-06 ***
## GenderMale             0.289108   0.384243   0.752 0.451805    
## Age                    0.049712   0.012781   3.889 0.000100 ***
## DiabetesYes            1.301108   0.537003   2.423 0.015397 *  
## Smoking_StatusSmoker   0.429878   0.442048   0.972 0.330817    
## Alcohol_Consumption   -0.005700   0.005841  -0.976 0.329140    
## Systolic_BP            0.045972   0.013769   3.339 0.000841 ***
## LDL_Cholesterol        0.004906   0.008581   0.572 0.567498    
## Triglycerides          0.009351   0.005667   1.650 0.098911 .  
## BMI                    0.102535   0.050918   2.014 0.044036 *  
## Resting_ECG           -0.145819   0.282608  -0.516 0.605870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 214.71  on 154  degrees of freedom
## Residual deviance: 170.51  on 144  degrees of freedom
## AIC: 192.51
## 
## Number of Fisher Scoring iterations: 4

summary(heart_backward)

## 
## Call:
## glm(formula = Heart_Attack ~ Age + Diabetes + Systolic_BP + Triglycerides + 
##     BMI, family = binomial, data = train_df)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -12.796540   2.667800  -4.797 1.61e-06 ***
## Age             0.048645   0.012390   3.926 8.63e-05 ***
## DiabetesYes     1.210555   0.520703   2.325 0.020080 *  
## Systolic_BP     0.047211   0.013569   3.479 0.000503 ***
## Triglycerides   0.010597   0.005581   1.899 0.057581 .  
## BMI             0.109158   0.050293   2.170 0.029972 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 214.71  on 154  degrees of freedom
## Residual deviance: 173.67  on 149  degrees of freedom
## AIC: 185.67
## 
## Number of Fisher Scoring iterations: 4

Both forward and backward selection identify that Age, Diabetes, Systolic_BP and BMI are statistically significant, with p-values < 0.05. Forward selection stopped at the full model, retaining all the original variables, retaining the original AIC at 192.51, indicating potential overfitting by keeping weaker predictors.

Backward selection retained Age, Diabetes, Systolic_BP, Triglycerides and BMI and dropped the variables with lower predictive significance. The AIC is lower at 185.67, reducing noise and overfitting, however this may sacrifice model sensitivity leading to more at-risk patients being missed.

# Create the backward selection variables model
back_mod <- glm(Heart_Attack ~ Age + Diabetes + Systolic_BP + BMI + Triglycerides, data=train_df, family = binomial)

# Use the new model to predict each of the observations in the training set
back_mod_prob <- predict(back_mod, train_df, type = "response")

# Convert prediction values into Yes or No with a 0.4 threshold
back_mod_pred <- ifelse(back_mod_prob > 0.4, "Yes", "No")

# Generate the confusion matrix
cm_back <- table(Predicted = back_mod_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics
roc_curve_back <- roc(train_df$Heart_Attack, back_mod_prob)

# Calculate error rates and metrics
sensitivity_back <- cm_back[2,2]/(cm_back[2,2]+cm_back[1,2])
specificity_back <- cm_back[1,1]/(cm_back[1,1]+cm_back[2,1])
misclassification_back <- (cm_back[2,1] + cm_back[1,2])/nrow(train_df)
cm_back

##          Actual
## Predicted No Yes
##       No  43  10
##       Yes 32  70

plot(roc_curve_back, col="pink", main = "ROC Curve - Logistic Backward Step Model")

auc(roc_curve_back)

## Area under the curve: 0.7842

cat("Sensitivity (Backward Step) =", sensitivity_back, "\n")

## Sensitivity (Backward Step) = 0.875

cat("Specificity (Backward Step) =", specificity_back, "\n")

## Specificity (Backward Step) = 0.5733333

cat("Misclassification Error (Backward Step) =", misclassification_back, "\n")

## Misclassification Error (Backward Step) = 0.2709677

The revised backward model returns a similar AUC at 0.78 compared to the original model at 0.79. Sensitivity is increased at 0.88 in the backward model (compared to 0.86 full model 0.4 threshold), specificity increased to 0.57 (compared to 0.55) and misclassification is also lower at 0.27 (compared to 0.29).

The backward model also predicted the less false negatives (11) than the full model, meaning its performance in accurately detecting at-risk patients is greater. Clinically, this means less at-risk patients are missed from being flagged for further interventions in clinic to prevent heart attacks.

The revised backward model with 0.4 threshold will be included in further evaluation and comparison.

2.2 LDA Model

The linear discriminant analysis model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.

# Build lda model using specified variables
lda_mod <- lda(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df)

# Use the generated model to predict each of the observations in the training set
lda_mod_prob <- predict(lda_mod, train_df)

# Extract class predictions with default 0.5 threshold
lda_mod_pred <- ifelse(lda_mod_prob$posterior[,"Yes"] > 0.5, "Yes", "No")

With the default 0.5 threshold, the LDA model predicts 77 heart attacks and 78 non-heart attacks, which reflects the 50/50 split in the original dataset.

# Generate the confusion matrix
cm <- table(Predicted = lda_mod_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, lda_mod_prob$posterior[, "Yes"])

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  54  24
##       Yes 21  56

plot(roc_curve, col="pink", main = "ROC Curve - LDA Model")

auc(roc_curve)

## Area under the curve: 0.7942

cat("Sensitivity =", sensitivity, "\n")

## Sensitivity = 0.7

cat("Specificity =", specificity, "\n")

## Specificity = 0.72

cat("Misclassification Error =", misclassification, "\n")

## Misclassification Error = 0.2903226

The LDA model has reasonable predictive ability with an AUC of 0.79 from the ROC curve and balanced sensitivity (0.7) and specificity (0.72). The misclassification error is also relatively low at 0.29, showing evident discrimination between patients. To achieve clinical relevance and reduce false negative classifications of at-risk patients, the threshold is tuned to 0.4.

# Manually update to 0.4 threshold for classification of probabilities
lda_mod_pred2 <- ifelse(lda_mod_prob$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrix
cm2 <- table(Predicted = lda_mod_pred2, Actual = train_df$Heart_Attack)

# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2

##          Actual
## Predicted No Yes
##       No  42  11
##       Yes 33  69

cat("Sensitivity (0.4 threshold) =", sensitivity2, "\n")

## Sensitivity (0.4 threshold) = 0.8625

cat("Specificity (0.4 threshold) =", specificity2, "\n")

## Specificity (0.4 threshold) = 0.56

cat("Misclassification Error (0.4 threshold) =", misclassification2, "\n")

## Misclassification Error (0.4 threshold) = 0.283871

The 0.4 threshold returns increased sensitivity of 0.86, reduced specificity of 0.56 and a slightly lower misclassification error of 0.28. Due to the severe clinical implications of failing to detect at-risk patients, the tuned threshold is more appropriate to maximise sensitivity.

K-fold cross-validation is applied to the model to assess how effectively it generalises to unseen patient data and ensure it is not overfitting by using more variables than required in predictions, thus misdiagnosing healthy patients and neglecting intervention for at-risk patients.

# Create function for repeated k-fold cross-validation (Geeks for Geeks 2025)
cross_validate_lda <- function(data, target, k) {
  set.seed(35602287)  # For reproducibility
  
  # Define the trainControl object including calculations for ROC sensitivity and specificity
  train_control <- trainControl(method = "repeatedcv", number = k, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
  
  # Train the lda regression model with cross-validation
  model <- train(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG,
                 data = data,
                 method = "lda",
                 metric = "ROC",
                 trControl = train_control)
  
  return(model)
}

# Perform 10-Fold Cross-Validation
cv_model <- cross_validate_lda(train_df, "Heart_Attack", 10)

# Display the cross-validation results
print(cv_model)

## Linear Discriminant Analysis 
## 
## 155 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens       Spec 
##   0.7385417  0.6410714  0.675

print(cv_model$resample)

##          ROC      Sens  Spec    Resample
## 1  0.5781250 0.6250000 0.500 Fold01.Rep1
## 2  0.9218750 0.8750000 0.750 Fold02.Rep1
## 3  0.5000000 0.3750000 0.375 Fold03.Rep1
## 4  0.8571429 0.7142857 0.750 Fold04.Rep1
## 5  0.7343750 0.6250000 0.625 Fold05.Rep1
## 6  0.7343750 0.6250000 0.625 Fold06.Rep1
## 7  0.7321429 0.7142857 0.500 Fold07.Rep1
## 8  0.8571429 0.5714286 0.875 Fold08.Rep1
## 9  0.6071429 0.4285714 0.750 Fold09.Rep1
## 10 0.8035714 0.7142857 0.875 Fold10.Rep1
## 11 0.8750000 0.7500000 0.875 Fold01.Rep2
## 12 0.7343750 0.5000000 0.750 Fold02.Rep2
## 13 0.6875000 0.6250000 0.625 Fold03.Rep2
## 14 0.9285714 0.8571429 0.750 Fold04.Rep2
## 15 0.6875000 0.7500000 0.500 Fold05.Rep2
## 16 0.6071429 0.5714286 0.500 Fold06.Rep2
## 17 0.7678571 0.7142857 0.875 Fold07.Rep2
## 18 0.5178571 0.5714286 0.375 Fold08.Rep2
## 19 0.8928571 0.8571429 0.750 Fold09.Rep2
## 20 0.6875000 0.3750000 0.750 Fold10.Rep2
## 21 0.8906250 0.8750000 0.625 Fold01.Rep3
## 22 0.6093750 0.3750000 0.750 Fold02.Rep3
## 23 0.5625000 0.6250000 0.250 Fold03.Rep3
## 24 0.5178571 0.4285714 0.750 Fold04.Rep3
## 25 0.7857143 0.5714286 0.750 Fold05.Rep3
## 26 0.8125000 0.6250000 1.000 Fold06.Rep3
## 27 0.8750000 0.7142857 0.750 Fold07.Rep3
## 28 0.6406250 0.7500000 0.375 Fold08.Rep3
## 29 0.8392857 0.7142857 0.750 Fold09.Rep3
## 30 0.9107143 0.7142857 0.875 Fold10.Rep3

Cross-validation confirms the model achieves mean 0.74 AUC, 0.64 sensitivity and 0.68 specificity. While the performance metrics are moderate, variability across folds indicates that some predictors may not be normally distributed or have equal variances, or are statistically weak in predictive power and sabotaging the model.

# Calculate the importance of predictors
vimp <- varImp(cv_model)
knitr::kable(vimp$importance, caption = "LDA Model Variable Importance", digits = 2)

Table 3: LDA Model Variable Importance
	No	Yes
Gender	32.62	32.62
Age	78.17	78.17
Diabetes	23.83	23.83
Smoking_Status	14.79	14.79
Alcohol_Consumption	9.03	9.03
Systolic_BP	100.00	100.00
LDL_Cholesterol	15.97	15.97
Triglycerides	43.36	43.36
BMI	34.28	34.28
Resting_ECG	0.00	0.00

From Table 3 above, Systolic_BP is indicated as the most important predictor (scaled to 100), followed by Age (78.17), Triglycerides (43.36), BMI (34.28) and Gender (32.62), with Smoking_Status, LDL_Cholesterol, Alcohol_Consumption and Resting_ECG not contributing substantially to the model.

A simplified model is generated excluding the variables identified as less significant for predicting heart attack risk.

# Tune model removing insignificant variables (OpenAI 2024)
lda_refined <- lda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)

# Use the generated model to predict each of the observations in the training set
lda_ref_prob <- predict(lda_refined, train_df)

# Extract class predictions with 0.4 threshold
lda_ref_pred <- ifelse(lda_ref_prob$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrix
cm <- table(Predicted = lda_ref_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, lda_ref_prob$posterior[, "Yes"])

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  43  12
##       Yes 32  68

plot(roc_curve, col="pink", main = "ROC Curve - Refined LDA Model")

auc(roc_curve)

## Area under the curve: 0.7925

cat("Sensitivity (0.4 threshold) =", sensitivity, "\n")

## Sensitivity (0.4 threshold) = 0.85

cat("Specificity (0.4 threshold) =", specificity, "\n")

## Specificity (0.4 threshold) = 0.5733333

cat("Misclassification Error (0.4 threshold) =", misclassification, "\n")

## Misclassification Error (0.4 threshold) = 0.283871

# Conduct repeated cross validation on refined simpler model (OpenAI 2024)
set.seed(35602287)
cv_model_subset <- train(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI,
  data = train_df,
  method = "lda",
  metric = "ROC",
  trControl = trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 3, 
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)
)

print(cv_model_subset)

## Linear Discriminant Analysis 
## 
## 155 samples
##   6 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.7664435  0.685119  0.7208333

The simplified LDA model demonstrates similar moderate-strong performance across metrics, maintaining AUC of 0.79, sensitivity of 0.85, specificity of 0.57 and error rate of 0.27. Under cross-validation, the simplified model returns a higher average AUC of 0.77, sensitivity of 0.69 and specificity of 0.72, conveying this model generalises better to unseen patients due to the removal of weak heart-attack risk variables. However, the simplification lead to 12 at-risk patients being misdiagnosed as non-risk, compared to the full model which only missed 11 at-risk patients.

Despite the small increase in misdiagnoses of at-risk patients, the importance of generalisation and reliability of the model is crucial for new patients being accurately categorised. Thus, the simplified model with 0.4 threshold will be included in further evaluation and comparison.

2.3 QDA Model

The quadratic discriminant analysis model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.

# Build qda model using specified variables
qda_mod <- qda(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data=train_df)

# Use the generated model to predict each of the observations in the training set
qda_mod_prob <- predict(qda_mod, train_df)

# Extract class predictions with default 0.5 threshold
qda_mod_pred <- ifelse(qda_mod_prob$posterior[,"Yes"] > 0.5, "Yes", "No")

With the default 0.5 threshold, the QDA model predicts 75 heart attacks and 80 non-heart attacks, which is slightly unbalanced compared to the original dataset.

# Generate the confusion matrix
cm <- table(Predicted = qda_mod_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, qda_mod_prob$posterior[, "Yes"])

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  60  20
##       Yes 15  60

plot(roc_curve, col="pink", main = "ROC Curve - QDA Model")

auc(roc_curve)

## Area under the curve: 0.853

cat("Sensitivity =", sensitivity, "\n")

## Sensitivity = 0.75

cat("Specificity =", specificity, "\n")

## Specificity = 0.8

cat("Misclassification Error =", misclassification, "\n")

## Misclassification Error = 0.2258065

The QDA model has strong predictive ability with an AUC of 0.85 from the ROC curve and balanced sensitivity (0.75) and specificity (0.8). The misclassification error is also relaively low at 0.23, showing moderate discrimination between patients. The threshold is tuned to 0.4 to test if sensitivity can be further improved to detect a greater proportion of at-risk patients.

# Manually update to 0.4 threshold for classification of probabilities
qda_mod_pred2 <- ifelse(qda_mod_prob$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrix
cm2 <- table(Predicted = qda_mod_pred2, Actual = train_df$Heart_Attack)

# Calculate error rates and metrics
sensitivity2 <- cm2[2,2]/(cm2[2,2]+cm2[1,2])
specificity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1])
misclassification2 <- (cm2[2,1] + cm2[1,2])/nrow(train_df)
cm2

##          Actual
## Predicted No Yes
##       No  52  13
##       Yes 23  67

cat("Sensitivity (0.4 threshold) =", sensitivity2, "\n")

## Sensitivity (0.4 threshold) = 0.8375

cat("Specificity (0.4 threshold) =", specificity2, "\n")

## Specificity (0.4 threshold) = 0.6933333

cat("Misclassification Error (0.4 threshold) =", misclassification2, "\n")

## Misclassification Error (0.4 threshold) = 0.2322581

The reduced threshold returns increased sensitivity of 0.84 at the expense of decreased specificity of 0.69 and a greater misclassification error of 0.23. The sacrifice of reduced specificity at 0.4 threshold is justified to ensure greater sensitivity in detecting at-risk patients accurately and commencing clinical interventions.

K-fold cross-validation is applied to the model to confirm generalisation for unseen data and ensure that the model’s strength is maintained for new patients being assessed.

# Create function for repeated k-fold cross-validation (Geeks for Geeks 2025)
cross_validate_qda <- function(data, target, k) {
  set.seed(35602287)  # For reproducibility
  
  # Define the trainControl object including calculations for ROC sensitivity and specificity
  train_control <- trainControl(method = "repeatedcv", number = k, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
  
  # Train the lda regression model with cross-validation
  model <- train(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG,
                 data = data,
                 method = "qda",
                 metric = "ROC",
                 trControl = train_control)
  
  return(model)
}

# Perform 10-Fold Cross-Validation
cv_model <- cross_validate_qda(train_df, "Heart_Attack", 10)

# Display the cross-validation results
print(cv_model)

## Quadratic Discriminant Analysis 
## 
## 155 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens       Spec  
##   0.6166667  0.5416667  0.5625

print(cv_model$resample)

##          ROC      Sens  Spec    Resample
## 1  0.4531250 0.3750000 0.625 Fold01.Rep1
## 2  0.6250000 0.7500000 0.375 Fold02.Rep1
## 3  0.5625000 0.5000000 0.375 Fold03.Rep1
## 4  0.7142857 0.7142857 0.625 Fold04.Rep1
## 5  0.4531250 0.6250000 0.250 Fold05.Rep1
## 6  0.5937500 0.5000000 0.625 Fold06.Rep1
## 7  0.7142857 0.5714286 0.375 Fold07.Rep1
## 8  0.6250000 0.1428571 0.750 Fold08.Rep1
## 9  0.6428571 0.4285714 0.625 Fold09.Rep1
## 10 0.8928571 0.7142857 0.875 Fold10.Rep1
## 11 0.8437500 0.7500000 0.625 Fold01.Rep2
## 12 0.5312500 0.5000000 0.625 Fold02.Rep2
## 13 0.5781250 0.5000000 0.500 Fold03.Rep2
## 14 0.7678571 0.7142857 0.500 Fold04.Rep2
## 15 0.6718750 0.7500000 0.375 Fold05.Rep2
## 16 0.4642857 0.5714286 0.500 Fold06.Rep2
## 17 0.6071429 0.4285714 0.750 Fold07.Rep2
## 18 0.4464286 0.4285714 0.625 Fold08.Rep2
## 19 0.8571429 0.7142857 0.875 Fold09.Rep2
## 20 0.5468750 0.2500000 0.625 Fold10.Rep2
## 21 0.7031250 0.7500000 0.250 Fold01.Rep3
## 22 0.5937500 0.5000000 0.500 Fold02.Rep3
## 23 0.5000000 0.7500000 0.000 Fold03.Rep3
## 24 0.5535714 0.4285714 0.625 Fold04.Rep3
## 25 0.6250000 0.5714286 0.625 Fold05.Rep3
## 26 0.5312500 0.3750000 1.000 Fold06.Rep3
## 27 0.6428571 0.4285714 0.500 Fold07.Rep3
## 28 0.4375000 0.3750000 0.625 Fold08.Rep3
## 29 0.6607143 0.5714286 0.625 Fold09.Rep3
## 30 0.6607143 0.5714286 0.625 Fold10.Rep3

Cross-validation confirms the model achieves mean 0.62 AUC, 0.54 sensitivity and 0.56 specificity. The performance metrics are much weaker than the training model, suggesting the model does not perform well on new patient data and insignificant variables cause inconsistencies in diagnoses across patients.

Using the variable importance levels in Table 3, a simplified model is generated excluding insignificant predictors.

# Tune model removing insignificant variables (OpenAI 2024)
qda_refined <- qda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)

# Use the generated model to predict each of the observations in the training set
qda_ref_prob <- predict(qda_refined, train_df)

# Extract class predictions with 0.4 threshold
qda_ref_pred <- ifelse(qda_ref_prob$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrix
cm <- table(Predicted = qda_ref_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, qda_ref_prob$posterior[, "Yes"])

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  48  13
##       Yes 27  67

plot(roc_curve, col="pink", main = "ROC Curve - Refined QDA Model")

auc(roc_curve)

## Area under the curve: 0.8158

cat("Sensitivity (0.4 threshold) =", sensitivity, "\n")

## Sensitivity (0.4 threshold) = 0.8375

cat("Specificity (0.4 threshold) =", specificity, "\n")

## Specificity (0.4 threshold) = 0.64

cat("Misclassification Error (0.4 threshold) =", misclassification, "\n")

## Misclassification Error (0.4 threshold) = 0.2580645

# Conduct repeated cross validation on refined simpler model (OpenAI 2024)
set.seed(35602287)
cv_model_subset <- train(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI,
  data = train_df,
  method = "qda",
  metric = "ROC",
  trControl = trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 3, 
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)
)

print(cv_model_subset)

## Quadratic Discriminant Analysis 
## 
## 155 samples
##   6 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7029018  0.6071429  0.6166667

The simplified QDA model demonstrates slightly decreased performance across all metrics compared to the full model, with AUC of 0.82, balanced sensitivity of 0.84, decreased specificity of 0.64 and increased error rate of 0.26. Under cross-validation, the simplified model returns a higher average AUC of 0.7, sensitivity at 0.61 and specificity of 0.62, suggesting this model is much stronger in minimising fitting for insignificant risk-factors. The simplification did not alter the volume of at-risk patients being misdiagnosed as non-risk.

To prioritise clinical safety and ability to generalise the model to new unseen patient data accurately, the simplified model with 0.4 threshold will be included in further evaluation and comparison.

2.4 k-NN Model

The k-Nearest Neighbours model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.

# Convert categorical variables to dummy variables for processing (OpenAI 2024)
dummy <- dummyVars(~ ., data = train_df[, c("Gender", "Diabetes", "Smoking_Status")])
dummy <- predict(dummy, newdata = train_df[, c("Gender", "Diabetes", "Smoking_Status")])

# Standardise numeric variables to equalise weightings
num <- scale(train_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")])

# Combine scaled numeric and dummy variables, ensuring output is a dataframe
train_knn <- data.frame(cbind(num, dummy))

# Train knn model using k=3
set.seed(35602287)
knn_mod <- knn(train_knn, train_knn, train_df$Heart_Attack, k = 3)

# Generate the confusion matrix
cm <- table(Predicted = knn_mod, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, as.numeric(knn_mod))

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  60  20
##       Yes 15  60

plot(roc_curve, col="pink", main = "ROC Curve - k-NN Model")

auc(roc_curve)

## Area under the curve: 0.775

cat("Sensitivity (k=3) =", sensitivity, "\n")

## Sensitivity (k=3) = 0.75

cat("Specificity (k=3) =", specificity, "\n")

## Specificity (k=3) = 0.8

cat("Misclassification Error (k=3) =", misclassification, "\n")

## Misclassification Error (k=3) = 0.2258065

Using k=3, the k-NN model displays moderate discriminatory ability between at-risk and non-risk patients, with an AUC of 0.78, relatively balanced sensitivity of 0.75 and specificity of 0.8 and an error rate of 0.23. Whilst performance metrics demonstrate that the model with k=3 is generally efficient in detecting at-risk patients so they can be put forward for appropriate interventions, testing is conducted to identify the optimal value for k.

# Create loop to find optimal k (OpenAI 2024)
set.seed(35602287)

# Set empty lists for k value and sensitivity results to append whilst looping
k_values <- c()
sens_list <- c()

# Iterate through k values of 1-20
for (k in 1:20) {
  # Train the model for that value of k
  knn_mod <- knn(train_knn, train_knn, train_df$Heart_Attack, k = k)
  # Generate the confusion matrix
  cm <- table(Predicted = knn_mod, Actual = train_df$Heart_Attack)
  # Calculate sensitivity
  sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
  
  # Append k values and sensitivity calculations to lists for plotting
  k_values <- append(k_values, k)
  sens_list <- append(sens_list, sensitivity)
}

plot(k_values, sens_list, pch = 15, col = "Pink")

Figure 1: Sensitivity by k-value

In Figure 1 above, it is clear that k=1 produces a perfect sensitivity rate, ensuring that at-risk patients are consistently identified and do not go undetected for treatment. However, cross-validation is applied to confirm the generalisation of the k=1 model compared to the second strongest k=2 value.

# Conduct cross validation on k=1 (OpenAI 2024)
set.seed(35602287)

knn_cv <- train(
  Heart_Attack ~ ., 
  data = mutate(train_knn, Heart_Attack = train_df$Heart_Attack),
  method = "knn",
  metric = "ROC",
  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary),
  tuneGrid = data.frame(k = 1) 
)

print(knn_cv)

## k-Nearest Neighbors 
## 
## 155 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 139, 139, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens       Spec  
##   0.5410714  0.5696429  0.5125
## 
## Tuning parameter 'k' was held constant at a value of 1

print(knn_cv$resample)

##          ROC      Sens  Spec    Resample
## 1  0.5000000 0.6250000 0.375 Fold01.Rep1
## 2  0.4375000 0.5000000 0.375 Fold02.Rep1
## 3  0.5625000 0.6250000 0.500 Fold03.Rep1
## 4  0.5446429 0.7142857 0.375 Fold04.Rep1
## 5  0.5625000 0.6250000 0.500 Fold05.Rep1
## 6  0.5625000 0.5000000 0.625 Fold06.Rep1
## 7  0.5357143 0.5714286 0.500 Fold07.Rep1
## 8  0.6696429 0.7142857 0.625 Fold08.Rep1
## 9  0.3928571 0.2857143 0.500 Fold09.Rep1
## 10 0.7321429 0.7142857 0.750 Fold10.Rep1
## 11 0.5625000 0.5000000 0.625 Fold01.Rep2
## 12 0.5000000 0.7500000 0.250 Fold02.Rep2
## 13 0.7500000 0.6250000 0.875 Fold03.Rep2
## 14 0.6696429 0.7142857 0.625 Fold04.Rep2
## 15 0.3750000 0.6250000 0.125 Fold05.Rep2
## 16 0.4017857 0.4285714 0.375 Fold06.Rep2
## 17 0.6696429 0.7142857 0.625 Fold07.Rep2
## 18 0.4732143 0.5714286 0.375 Fold08.Rep2
## 19 0.7232143 0.5714286 0.875 Fold09.Rep2
## 20 0.4375000 0.2500000 0.625 Fold10.Rep2
## 21 0.5625000 0.7500000 0.375 Fold01.Rep3
## 22 0.5625000 0.6250000 0.500 Fold02.Rep3
## 23 0.3750000 0.3750000 0.375 Fold03.Rep3
## 24 0.4642857 0.4285714 0.500 Fold04.Rep3
## 25 0.6607143 0.5714286 0.750 Fold05.Rep3
## 26 0.4375000 0.5000000 0.375 Fold06.Rep3
## 27 0.5267857 0.4285714 0.625 Fold07.Rep3
## 28 0.5000000 0.5000000 0.500 Fold08.Rep3
## 29 0.5357143 0.5714286 0.500 Fold09.Rep3
## 30 0.5446429 0.7142857 0.375 Fold10.Rep3

# Conduct cross validation on k=2 (OpenAI 2024)
knn_cv <- train(
  Heart_Attack ~ ., 
  data = mutate(train_knn, Heart_Attack = train_df$Heart_Attack),
  method = "knn",
  metric = "ROC",
  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary),
  tuneGrid = data.frame(k = 2) 
)

print(knn_cv)

## k-Nearest Neighbors 
## 
## 155 samples
##  13 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 139, 140, 140, 140, 139, 139, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.5999628  0.5404762  0.5208333
## 
## Tuning parameter 'k' was held constant at a value of 2

print(knn_cv$resample)

##          ROC      Sens  Spec    Resample
## 1  0.6093750 0.5000000 0.375 Fold01.Rep1
## 2  0.7946429 1.0000000 0.625 Fold02.Rep1
## 3  0.4553571 0.4285714 0.625 Fold03.Rep1
## 4  0.7589286 0.4285714 0.750 Fold04.Rep1
## 5  0.5546875 0.5000000 0.500 Fold05.Rep1
## 6  0.8125000 0.6250000 0.625 Fold06.Rep1
## 7  0.4609375 0.5000000 0.375 Fold07.Rep1
## 8  0.5892857 0.5714286 0.375 Fold08.Rep1
## 9  0.3035714 0.2857143 0.250 Fold09.Rep1
## 10 0.6562500 0.6250000 0.500 Fold10.Rep1
## 11 0.6428571 0.5714286 0.625 Fold01.Rep2
## 12 0.6428571 0.5714286 0.500 Fold02.Rep2
## 13 0.4910714 0.5714286 0.625 Fold03.Rep2
## 14 0.6406250 0.6250000 0.875 Fold04.Rep2
## 15 0.6562500 0.7500000 0.500 Fold05.Rep2
## 16 0.7578125 0.6250000 0.500 Fold06.Rep2
## 17 0.7031250 0.6250000 0.625 Fold07.Rep2
## 18 0.5000000 0.4285714 0.625 Fold08.Rep2
## 19 0.2968750 0.3750000 0.500 Fold09.Rep2
## 20 0.7589286 0.2857143 0.500 Fold10.Rep2
## 21 0.5390625 0.5000000 0.375 Fold01.Rep3
## 22 0.8750000 0.6250000 0.750 Fold02.Rep3
## 23 0.6517857 0.8571429 0.500 Fold03.Rep3
## 24 0.5000000 0.2857143 0.375 Fold04.Rep3
## 25 0.5000000 0.4285714 0.375 Fold05.Rep3
## 26 0.6517857 0.2857143 0.500 Fold06.Rep3
## 27 0.6875000 0.7142857 0.250 Fold07.Rep3
## 28 0.5468750 0.5000000 0.750 Fold08.Rep3
## 29 0.4609375 0.3750000 0.375 Fold09.Rep3
## 30 0.5000000 0.7500000 0.500 Fold10.Rep3

Cross-validation shows lower variability across folds for k=1 with moderate metrics (AUC 0.54, sensitivity 0.57 and specificity 0.51), whereas folds are more highly variable for k=2 with similar metrics (AUC 0.6, sensitivity 0.54 and specificity 0.52). The low variability in fold metrics indicates that the k=1 model performance is stable with the variables included, and all provide some level of statistical significance.

The k=1 model will be included in further evaluation and comparison.

2.5 Decision Tree Model

The decision tree model is trained using the relevant risk factor variables for heart attack identified in Section 1.2.

# Build the initial decision tree model using selected variables
big_tree <- rpart(Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + BMI + Resting_ECG, data = train_df)
# Plot the decision tree
rpart.plot(big_tree, box.palette = "Purples")

Figure 2: Decision Tree for Heart Attack Risk

# Calculate CP
printcp(big_tree)

## 
## Classification tree:
## rpart(formula = Heart_Attack ~ Gender + Age + Diabetes + Smoking_Status + 
##     Alcohol_Consumption + Systolic_BP + LDL_Cholesterol + Triglycerides + 
##     BMI + Resting_ECG, data = train_df)
## 
## Variables actually used in tree construction:
## [1] Age                 Alcohol_Consumption Smoking_Status     
## [4] Systolic_BP         Triglycerides      
## 
## Root node error: 75/155 = 0.48387
## 
## n= 155 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.306667      0   1.00000 1.13333 0.082610
## 2 0.120000      1   0.69333 0.78667 0.080600
## 3 0.036667      2   0.57333 0.61333 0.075834
## 4 0.026667      6   0.42667 0.72000 0.079092
## 5 0.020000      7   0.40000 0.78667 0.080600
## 6 0.010000      9   0.36000 0.84000 0.081534

In Figure 2 above, there are a total of 19 nodes generated with 10 terminal nodes and 9 splits in the tree. The model automatically selected Systolic_BP, Age, Alcohol_Consumption, Triglycerides and Smoking_Status as the core predictors for heart attack.

The optimal complexity parameter with the lowest xerror is identified as 0.036667 to further prune the tree.

# Prune the tree 
rpart_pruned <- prune(big_tree,cp=0.036667) 
rpart.plot(rpart_pruned, box.palette = "Purples")

Figure 3: Pruned Decision Tree for Heart Attack Risk

In Figure 3 above, there are a total of 5 nodes generated with 3 terminal nodes and 2 splits in the tree based on the Systolic_BP < 127 and Age < 60 rules.

This pruned model simplifies predictions by initially checking if the patient meets the Systolic_BP < 127 rule. The end node in the middle indicates that patients in this node initially met the Systolic_BP < 127 rule, however did not meet the next tier rule of Age < 60. Therefore, patients in this group are predicted to suffer a heart attack, with the probability of heart attack being 0.76 and the probability of no heart attack is 0.24. 11% of the patients in the training sample reached this leaf node.

# Generate predictions
tree_pred <- predict(rpart_pruned,train_df,type = 'class') 

# Generate the confusion matrix
cm <- table(Predicted = tree_pred, Actual = train_df$Heart_Attack)

# Analyse the ROC Curve and AUC metrics (OpenAI 2024)
roc_curve <- roc(train_df$Heart_Attack, as.numeric(tree_pred))

# Calculate error rates and metrics
sensitivity <- cm[2,2]/(cm[2,2]+cm[1,2])
specificity <- cm[1,1]/(cm[1,1]+cm[2,1])
misclassification <- (cm[2,1] + cm[1,2])/nrow(train_df)
cm

##          Actual
## Predicted No Yes
##       No  44  12
##       Yes 31  68

plot(roc_curve, col="pink", main = "ROC Curve - Decision Tree Model")

auc(roc_curve)

## Area under the curve: 0.7183

cat("Sensitivity =", sensitivity, "\n")

## Sensitivity = 0.85

cat("Specificity =", specificity, "\n")

## Specificity = 0.5866667

cat("Misclassification Error =", misclassification, "\n")

## Misclassification Error = 0.2774194

The pruned tree demonstrates moderate predictive ability, returning 0.72 AUC, 0.85 sensitivity, 0.59 specificity and 0.28 miscassification. The model favours accurate prediction of at-risk patients at the expense of misdiagnosing non-risk cases. This is preferred, as the alternative may mean that life-threatening diagnoses are missed and clinical interventions are not implemented in time.

Following the optimal complexity parameter, the pruned model will be included in further evaluation and comparison.

3 Model Evaluation

Each predictive model is evaluated based on performance across both the train and test datasets.

3.1 Logistic Model

# Create the backward step logistic model again
back_mod <- glm(Heart_Attack ~ Age + Diabetes + Systolic_BP + BMI + Triglycerides, data=train_df, family = binomial)

# Use the generated model to predict each of the observations in the training & test sets
log_mod_prob_train <- predict(back_mod, train_df, type = "response")

log_mod_prob_test <- predict(back_mod, test_df, type = "response")

# Convert prediction values into Yes or No with a 0.4 threshold
log_mod_pred_train <- ifelse(log_mod_prob_train > 0.4, "Yes", "No")

log_mod_pred_test <- ifelse(log_mod_prob_test > 0.4, "Yes", "No")

# Generate the confusion matrices
cm_train <- table(Predicted = log_mod_pred_train, Actual = train_df$Heart_Attack)

cm_test <- table(Predicted = log_mod_pred_test, Actual = test_df$Heart_Attack)

# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, log_mod_prob_train)

roc_curve_test <- roc(test_df$Heart_Attack, log_mod_prob_test)

Train Dataset Performance

# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train

##          Actual
## Predicted No Yes
##       No  43  10
##       Yes 32  70

plot(roc_curve_train, col="pink", main = "ROC Curve - Logistic Model Train Data")

auc(roc_curve_train)

## Area under the curve: 0.7842

cat("Sensitivity (train) =", sensitivity_train, "\n")

## Sensitivity (train) = 0.875

cat("Specificity (train) =", specificity_train, "\n")

## Specificity (train) = 0.5733333

cat("Accuracy (train) =", accuracy_train, "\n")

## Accuracy (train) = 0.7290323

cat("Misclassification Error (train) =", misclassification_train, "\n")

## Misclassification Error (train) = 0.2709677

Test Dataset Performance

# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test

##          Actual
## Predicted No Yes
##       No  21   9
##       Yes 15  22

plot(roc_curve_test, col="pink", main = "ROC Curve - Logistic Model Test Data")

auc(roc_curve_test)

## Area under the curve: 0.6944

cat("Sensitivity (test) =", sensitivity_test, "\n")

## Sensitivity (test) = 0.7096774

cat("Specificity (test) =", specificity_test, "\n")

## Specificity (test) = 0.5833333

cat("Accuracy (test) =", accuracy_test, "\n")

## Accuracy (test) = 0.641791

cat("Misclassification Error (test) =", misclassification_test, "\n")

## Misclassification Error (test) = 0.358209

The logistic regression model demonstrated moderate discriminative ability, with test set sensitivity of 0.71 and specificity of 0.58, achieving an overall accuracy of 0.64 and AUC of 0.69. This indicates the model correctly identifies a majority of patients at risk but has a tendency to misclassify some non-risk patients, reflecting a balanced but imperfect trade-off between sensitivity and specificity.

3.2 LDA Model

# Create the simplified lda model again
lda_refined <- lda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)

# Use the generated model to predict each of the observations in the training & test sets
lda_mod_prob_train <- predict(lda_refined, train_df)

lda_mod_prob_test <- predict(lda_refined, test_df)

# Convert prediction values into Yes or No with a 0.4 threshold
lda_mod_pred_train <- ifelse(lda_mod_prob_train$posterior[,"Yes"] > 0.4, "Yes", "No")

lda_mod_pred_test <- ifelse(lda_mod_prob_test$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrices
cm_train <- table(Predicted = lda_mod_pred_train, Actual = train_df$Heart_Attack)

cm_test <- table(Predicted = lda_mod_pred_test, Actual = test_df$Heart_Attack)

# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, lda_mod_prob_train$posterior[,"Yes"])

roc_curve_test <- roc(test_df$Heart_Attack, lda_mod_prob_test$posterior[,"Yes"])

Train Dataset Performance

# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train

##          Actual
## Predicted No Yes
##       No  43  12
##       Yes 32  68

plot(roc_curve_train, col="pink", main = "ROC Curve - LDA Model Train Data")

auc(roc_curve_train)

## Area under the curve: 0.7925

cat("Sensitivity (train) =", sensitivity_train, "\n")

## Sensitivity (train) = 0.85

cat("Specificity (train) =", specificity_train, "\n")

## Specificity (train) = 0.5733333

cat("Accuracy (train) =", accuracy_train, "\n")

## Accuracy (train) = 0.716129

cat("Misclassification Error (train) =", misclassification_train, "\n")

## Misclassification Error (train) = 0.283871

Test Dataset Performance

# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test

##          Actual
## Predicted No Yes
##       No  18   7
##       Yes 18  24

plot(roc_curve_test, col="pink", main = "ROC Curve - LDA Model Test Data")

auc(roc_curve_test)

## Area under the curve: 0.6873

cat("Sensitivity (test) =", sensitivity_test, "\n")

## Sensitivity (test) = 0.7741935

cat("Specificity (test) =", specificity_test, "\n")

## Specificity (test) = 0.5

cat("Accuracy (test) =", accuracy_test, "\n")

## Accuracy (test) = 0.6268657

cat("Misclassification Error (test) =", misclassification_test, "\n")

## Misclassification Error (test) = 0.3731343

The LDA model yielded slightly higher sensitivity on the test set (0.77) but lower specificity (0.50), resulting in an accuracy of 0.63 and AUC of 0.69. Its performance suggests that while LDA is more aggressive at identifying potential heart attack cases, it sacrifices specificity, potentially leading to more false positives. This trade-off may be acceptable due to the hospital prioritising early detection and intervention.

3.3 QDA Model

# Create the refined qda model again
qda_refined <- qda(Heart_Attack ~ Gender + Age + Diabetes + Systolic_BP + Triglycerides + BMI, data = train_df)

# Use the generated model to predict each of the observations in the training & test sets
qda_mod_prob_train <- predict(qda_refined, train_df)

qda_mod_prob_test <- predict(qda_refined, test_df)

# Convert prediction values into Yes or No with a 0.4 threshold
qda_mod_pred_train <- ifelse(qda_mod_prob_train$posterior[,"Yes"] > 0.4, "Yes", "No")

qda_mod_pred_test <- ifelse(qda_mod_prob_test$posterior[,"Yes"] > 0.4, "Yes", "No")

# Generate the confusion matrices
cm_train <- table(Predicted = qda_mod_pred_train, Actual = train_df$Heart_Attack)

cm_test <- table(Predicted = qda_mod_pred_test, Actual = test_df$Heart_Attack)

# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, qda_mod_prob_train$posterior[,"Yes"])

roc_curve_test <- roc(test_df$Heart_Attack, qda_mod_prob_test$posterior[,"Yes"])

Train Dataset Performance

# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train

##          Actual
## Predicted No Yes
##       No  48  13
##       Yes 27  67

plot(roc_curve_train, col="pink", main = "ROC Curve - QDA Model Train Data")

auc(roc_curve_train)

## Area under the curve: 0.8158

cat("Sensitivity (train) =", sensitivity_train, "\n")

## Sensitivity (train) = 0.8375

cat("Specificity (train) =", specificity_train, "\n")

## Specificity (train) = 0.64

cat("Accuracy (train) =", accuracy_train, "\n")

## Accuracy (train) = 0.7419355

cat("Misclassification Error (train) =", misclassification_train, "\n")

## Misclassification Error (train) = 0.2580645

Test Dataset Performance

# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test

##          Actual
## Predicted No Yes
##       No  19   9
##       Yes 17  22

plot(roc_curve_test, col="pink", main = "ROC Curve - QDA Model Test Data")

auc(roc_curve_test)

## Area under the curve: 0.6676

cat("Sensitivity (test) =", sensitivity_test, "\n")

## Sensitivity (test) = 0.7096774

cat("Specificity (test) =", specificity_test, "\n")

## Specificity (test) = 0.5277778

cat("Accuracy (test) =", accuracy_test, "\n")

## Accuracy (test) = 0.6119403

cat("Misclassification Error (test) =", misclassification_test, "\n")

## Misclassification Error (test) = 0.3880597

The QDA model produced comparable test sensitivity (0.71) but marginally improved specificity (0.53) relative to LDA, with overall accuracy of 0.61 and AUC of 0.67. While QDA captures non-linear relationships better than LDA, in this dataset the improvement in specificity is minor, and overall predictive accuracy remains moderate.

3.4 k-NN Model

# Convert categorical variables to dummy variables for test dataframe
dummy <- dummyVars(~ ., data = test_df[, c("Gender", "Diabetes", "Smoking_Status")])
dummy <- predict(dummy, newdata = test_df[, c("Gender", "Diabetes", "Smoking_Status")])

# Standardise numeric variables to equalise weightings
num <- scale(test_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")])

# Combine scaled numeric and dummy variables, ensuring output is a dataframe
test_knn <- data.frame(cbind(num, dummy))

# Create the knn model and predictions with k=1 again
set.seed(35602287)
knn_mod_train <- knn(train_knn, train_knn, train_df$Heart_Attack, k = 1)

knn_mod_test <- knn(train_knn, test_knn, train_df$Heart_Attack, k = 1)

# Generate the confusion matrices
cm_train <- table(Predicted = knn_mod_train, Actual = train_df$Heart_Attack)

cm_test <- table(Predicted = knn_mod_test, Actual = test_df$Heart_Attack)

# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, as.numeric(knn_mod_train))

roc_curve_test <- roc(test_df$Heart_Attack, as.numeric(knn_mod_test))

Train Dataset Performance

# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train

##          Actual
## Predicted No Yes
##       No  75   0
##       Yes  0  80

plot(roc_curve_train, col="pink", main = "ROC Curve - k-NN Model Train Data")

auc(roc_curve_train)

## Area under the curve: 1

cat("Sensitivity (train) =", sensitivity_train, "\n")

## Sensitivity (train) = 1

cat("Specificity (train) =", specificity_train, "\n")

## Specificity (train) = 1

cat("Accuracy (train) =", accuracy_train, "\n")

## Accuracy (train) = 1

cat("Misclassification Error (train) =", misclassification_train, "\n")

## Misclassification Error (train) = 0

Test Dataset Performance

# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test

##          Actual
## Predicted No Yes
##       No  20  12
##       Yes 16  19

plot(roc_curve_test, col="pink", main = "ROC Curve - k-NN Model Test Data")

auc(roc_curve_test)

## Area under the curve: 0.5842

cat("Sensitivity (test) =", sensitivity_test, "\n")

## Sensitivity (test) = 0.6129032

cat("Specificity (test) =", specificity_test, "\n")

## Specificity (test) = 0.5555556

cat("Accuracy (test) =", accuracy_test, "\n")

## Accuracy (test) = 0.5820896

cat("Misclassification Error (test) =", misclassification_test, "\n")

## Misclassification Error (test) = 0.4179104

The k-NN model achieved perfect performance on the training data (accuracy and AUC of 1), but its test set performance declined markedly (accuracy 0.58, sensitivity 0.61, specificity 0.56, AUC 0.58). This instability limits its reliability in practice despite strong training performance as the k=1 parameter is limiting with new patient data.

3.5 Decision Tree Model

# Create the pruned tree model predictions again
tree_pred_train <- predict(rpart_pruned,train_df,type = 'class') 

tree_pred_test <- predict(rpart_pruned,test_df,type = 'class') 

# Generate the confusion matrices
cm_train <- table(Predicted = tree_pred_train, Actual = train_df$Heart_Attack)

cm_test <- table(Predicted = tree_pred_test, Actual = test_df$Heart_Attack)

# Create the ROC Curves
roc_curve_train <- roc(train_df$Heart_Attack, as.numeric(tree_pred_train))

roc_curve_test <- roc(test_df$Heart_Attack, as.numeric(tree_pred_test))

Train Dataset Performance

# Calculate train metrics
sensitivity_train <- cm_train[2,2]/(cm_train[2,2]+cm_train[1,2])
specificity_train <- cm_train[1,1]/(cm_train[1,1]+cm_train[2,1])
accuracy_train <- (cm_train[1,1] + cm_train[2,2])/nrow(train_df)
misclassification_train <- (cm_train[2,1] + cm_train[1,2])/nrow(train_df)
cm_train

##          Actual
## Predicted No Yes
##       No  44  12
##       Yes 31  68

plot(roc_curve_train, col="pink", main = "ROC Curve - Decision Tree Model Train Data")

auc(roc_curve_train)

## Area under the curve: 0.7183

cat("Sensitivity (train) =", sensitivity_train, "\n")

## Sensitivity (train) = 0.85

cat("Specificity (train) =", specificity_train, "\n")

## Specificity (train) = 0.5866667

cat("Accuracy (train) =", accuracy_train, "\n")

## Accuracy (train) = 0.7225806

cat("Misclassification Error (train) =", misclassification_train, "\n")

## Misclassification Error (train) = 0.2774194

Test Dataset Performance

# Calculate test metrics
sensitivity_test <- cm_test[2,2]/(cm_test[2,2]+cm_test[1,2])
specificity_test <- cm_test[1,1]/(cm_test[1,1]+cm_test[2,1])
accuracy_test <- (cm_test[1,1] + cm_test[2,2])/nrow(test_df)
misclassification_test <- (cm_test[2,1] + cm_test[1,2])/nrow(test_df)
cm_test

##          Actual
## Predicted No Yes
##       No  15   3
##       Yes 21  28

plot(roc_curve_test, col="pink", main = "ROC Curve - Decision Tree Model Test Data")

auc(roc_curve_test)

## Area under the curve: 0.6599

cat("Sensitivity (test) =", sensitivity_test, "\n")

## Sensitivity (test) = 0.9032258

cat("Specificity (test) =", specificity_test, "\n")

## Specificity (test) = 0.4166667

cat("Accuracy (test) =", accuracy_test, "\n")

## Accuracy (test) = 0.641791

cat("Misclassification Error (test) =", misclassification_test, "\n")

## Misclassification Error (test) = 0.358209

The decision tree model demonstrated high sensitivity (0.90) but low specificity (0.42) on the test set, yielding accuracy of 0.64 and AUC of 0.66. The tree’s high sensitivity makes it effective for detecting high-risk patients, but the trade-off is a higher false-positive rate, which could burden clinical resources with unnecessary interventions and time-wasting activities.

3.6 Overall Evaluation

# Create a data frame with test set metrics
model_metrics <- tibble(
  Model = c("Logistic Regression", "LDA", "QDA", "k-NN (k=1)", "Decision Tree"),
  Sensitivity = c(0.710, 0.774, 0.710, 0.613, 0.903),
  Specificity = c(0.583, 0.500, 0.528, 0.556, 0.417),
  Accuracy = c(0.642, 0.627, 0.612, 0.582, 0.642),
  AUC = c(0.694, 0.687, 0.668, 0.584, 0.660)
)

# Round numeric values for better readability
model_metrics <- model_metrics %>% 
  mutate(across(Sensitivity:Accuracy, ~ round(., 3)),
         AUC = round(AUC, 3))

# Display table
kable(model_metrics, caption = "Test Set Performance Metrics Across Predictive Models")

Table 4: Test Set Performance Metrics Across Predictive Models
Model	Sensitivity	Specificity	Accuracy	AUC
Logistic Regression	0.710	0.583	0.642	0.694
LDA	0.774	0.500	0.627	0.687
QDA	0.710	0.528	0.612	0.668
k-NN (k=1)	0.613	0.556	0.582	0.584
Decision Tree	0.903	0.417	0.642	0.660

Considering the hospital’s priority to identify at-risk patients early while maintaining reasonable reliability, the LDA model provides the best balance of sensitivity and specificity, with consistent performance across train and test datasets. While the decision tree offers higher sensitivity, its reduced specificity may lead to over-identification. The logistic model is stable, however provides less aggressive sensitivity leading to more missed risk patients than the LDA model. Therefore, LDA is recommended as the primary predictive model for practical implementation, offering interpretable coefficients, stable performance, and actionable insights for clinical decision-making.

4 Decision Boundaries Visualisation

Decision boundary graphs for each model assist hospital employees to succinctly categorise new patients’ heart-attack risk visually, based on their individual recordings for the two most important numerical variables.

Table 3 indicates that Systolic_BP and Age are the two highest weighted variables in LDA and QDA models, aligning with the findings in Section 2.1 where these variables produced the most significant p-values in the logistic model. The decision tree only includes rules regarding Systolic_BP and Age, so these are evidently the most important factors for that model as well.

To clarify importance levels across all variables generally, a t-test is applied.

# Split out all numeric variables
num_vars <- train_df[, c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides", "BMI","Resting_ECG", "Alcohol_Consumption")]
# Run a t-test to generate p-values
pvals <- apply(num_vars, 2, function(x) t.test(x ~ train_df$Heart_Attack)$p.value)
# Sort ascending so most significant value is first
sort(pvals)

##         Systolic_BP                 Age                 BMI       Triglycerides 
##        0.0001482591        0.0008846272        0.0714552313        0.0763155803 
##     LDL_Cholesterol         Resting_ECG Alcohol_Consumption 
##        0.3056919644        0.5504942133        0.6015558653

The t-test confirms that Systolic_BP and Age produce the lowest p-values, validating their inclusion in the decision boundaries as the two most statistically significant risk-factors for predicting heart attack cases.

# Create sequence for range of values for Systolic_BP and Age
gx1 <- seq(91, 174, by = 1)
gx2 <- seq(20, 80, by = 0.7)
# Create the grid with the sequenced ranges and update column names
grid <- expand.grid(gx1, gx2)
colnames(grid) <- c('Systolic_BP', 'Age')
# Replicate a single row from the training data to add all other predictors into the grid
# Avoids errors from trained models including more variables than are in the grid
template <- train_df[rep(1, nrow(grid)), ]
# Reset the values for Systolic_BP and Age to the sequenced ranges
template$Systolic_BP <- grid$Systolic_BP
template$Age <- grid$Age

4.1 Logistic Model

# Use the trained model to predict on the grid template
log_grid <- predict(back_mod, template)
# Determine prediction classifications using the 0.4 threshold
log_grid_class <- ifelse(log_grid > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
  add_column(Class = log_grid_class) %>%
  ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))

Figure 4: Logistic Model Decision Boundary

The logistic regression boundary exhibits a smooth, linear separation between Systolic_BP and Age, reflecting that older patients with higher BP readings are at greater risk (over 50 year old and over 130 BP).

4.2 LDA Model

# Use the trained model to predict on the grid template
lda_grid <- predict(lda_refined, template)
# Determine prediction classifications using the 0.4 threshold
lda_grid_class <- ifelse(lda_grid$posterior[,"Yes"] > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
  add_column(Class = lda_grid_class) %>%
  ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))

Figure 5: LDA Model Decision Boundary

LDA produces a similar linear boundary but with a slightly different slope, as it assumes equal covariance across classes, resulting in more aggressive classification of at-risk patients (over 60 years old ond over 120 BP).

4.3 QDA Model

# Use the trained model to predict on the grid template
qda_grid <- predict(qda_refined, template)
# Determine prediction classifications using the 0.4 threshold
qda_grid_class <- ifelse(qda_grid$posterior[,"Yes"] > 0.4, "Yes", "No")
# Plot the decision boundary
template %>% as_tibble() %>%
  add_column(Class = qda_grid_class) %>%
  ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))

Figure 6: QDA Model Decision Boundary

QDA, by contrast, allows for quadratic boundaries, which can captures curved boundary lines. This is evident in areas of higher Systolic_BP combined with lower Age (20-40), where QDA predicts “Yes” in pockets where LDA and logistic regression still predict “No.”

4.4 k-NN Model

# Identify numeric columns used in train_knn
num_cols <- c("Age", "Systolic_BP", "LDL_Cholesterol", "Triglycerides",
              "BMI", "Resting_ECG", "Alcohol_Consumption") 

# Identify dummy columns (all others)
dummy_cols <- setdiff(colnames(train_knn), num_cols)

# Create grid for Systolic_BP and Age (OpenAI 2024)
gx1knn <- seq(min(train_df$Systolic_BP), max(train_df$Systolic_BP), by = 1)
gx2knn <- seq(min(train_df$Age), max(train_df$Age), by = 1)
gridknn <- expand.grid(Systolic_BP = gx1knn, Age = gx2knn)

# Fix other numeric features at mean
fixed_num <- colMeans(train_knn[, num_cols[!(num_cols %in% c("Systolic_BP","Age"))]])

# Fix dummy features at mode (0 or 1)
fixed_dummy <- apply(train_knn[, dummy_cols], 2, function(x) { round(mean(x)) })

# Create template for prediction
template2 <- cbind(
  gridknn,
  matrix(rep(fixed_num, each = nrow(gridknn)), ncol = length(fixed_num), byrow = FALSE),
  matrix(rep(fixed_dummy, each = nrow(gridknn)), ncol = length(fixed_dummy), byrow = FALSE)
)
colnames(template2) <- colnames(train_knn)

# Scale numeric columns only
template2[, num_cols] <- scale(template2[, num_cols],
                               center = colMeans(train_knn[, num_cols]),
                               scale  = apply(train_knn[, num_cols], 2, sd))

# Predict with k-NN
knn_grid <- knn(train_knn, template2, train_df$Heart_Attack, k = 1)

# Plot decision boundary
template2 %>%
  as_tibble() %>%
  add_column(Class = knn_grid) %>%
  ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))

Figure 7: k-NN Model Decision Boundary

The k-NN boundary is highly flawed capturing all predictions as at-risk patients of heart attacks. This due to instability in coding using k=1 and may misclassify due to only relying on a single datapoint.

4.5 Decision Tree Model

# Use the trained model to predict on the grid template
tree_grid <- predict(rpart_pruned, template, type="class")
# Plot the decision boundary
template %>% as_tibble() %>%
  add_column(Class = tree_grid) %>%
  ggplot(aes(x = Systolic_BP, y = Age, color = Class)) +
    geom_point(size = 1) +
    scale_color_manual(values = c("No" = "#FFC0CB", "Yes" = "#FF69B4"))

Figure 8: QDA Model Decision Boundary

The decision tree displays a stepwise, axis-aligned boundary, reflecting its hierarchical rule-based splits. It clearly separates the step boundary for all patients > 127 BP and > 60 being categorised as heart attack risks.

Linear models like logistic regression and LDA offer stable, generalisable boundaries but may underfit subtle non-linear relationships. QDA can capture these non-linearities but at the cost of increased variance, which is not desirable when the consequence is missing a heart attack risk. Decision trees provide clear, interpretable rules but sacrifice flexibility and deeper hidden relationships between the heart attack variables. k-NN is limited in generalisation to data outside of the training set.

Overall, for hospital risk assessment prioritising both interpretability and accurate prediction, logistic regression or LDA are recommended, providing consistent, clinically sensible boundaries. Other models are too unreliable for making life or death decisions.

5 Summary and Recommendation

The analysis confirms Systolic_BP and Age as the most influential predictors of heart-attack risk, supported by both statistical tests and the visualised decision boundaries. The logistic model provides a smooth linear boundary with balanced performance (sensitivity 0.71, specificity 0.58, accuracy 0.64, AUC 0.69), making it reliable and interpretable for clinical decision-making for patients to feel at ease about confirming their risk of heart attack. The LDA model increases sensitivity to 0.77 but lowers specificity to 0.50 (accuracy 0.63, AUC 0.69), reflecting a slightly more aggressive identification of at-risk patients, which is a crucial positive in ensuring early intervention is achieved for patients at-risk. QDA captures non-linear patterns, achieving sensitivity 0.71, specificity 0.53, accuracy 0.61, and AUC 0.67, but with moderate overall performance. The decision tree prioritises sensitivity (0.90) but has reduced specificity (0.42, accuracy 0.64, AUC 0.66), while k-NN is limited and unreliable in making health decisions which save lives.

For clinical implementation, the LDA model is stable is recommended to improve sensitivity whilst maintaining reasonable specificity, and interpretability, making it the most suitable model for hospital use so that all patients at-risk of heart attack are accurately diagnosed and appropriate intervention plans can be promptly implemented.

References

Aberra, T., Peterson, E. D., Pagidipati, N. J., Mulder, H., Wojdyla, D. M., Philip, S., Granowitz, C., & Navar, A. M. (2020). The association between triglycerides and incident cardiovascular disease: What is “optimal”? Journal of Clinical Lipidology, 14(4), 438–447. https://doi.org/10.1016/j.jacl.2020.04.009

Allaire, J., Xie, Y., Dervieux, C., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., Wickham, H., Cheng, J., Chang, W., & Iannone, R. (2024). Rmarkdown: Dynamic documents for r. https://github.com/rstudio/rmarkdown

American Heart Association. (2025). Understanding blood pressure readings. https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings

Auguie, B. (2017). gridExtra: Miscellaneous functions for "grid" graphics. https://CRAN.R-project.org/package=gridExtra

Australian Institute of Health and Welfare. (2024). Heart, stroke and vascular disease: Australian facts. https://www.aihw.gov.au/reports/heart-stroke-vascular-diseases/hsvd-facts/contents/risk-factors/people-with-heart-stroke-and-vascular-disease

Cleveland Clinic. (2024). LDL cholesterol. https://my.clevelandclinic.org/health/articles/24391-ldl-cholesterol

Geeks for Geeks. (2025). Cross validation function for logistic regression in r. https://www.geeksforgeeks.org/machine-learning/cross-validation-function-for-logistic-regression-in-r/

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://www.jstatsoft.org/v40/i03/

Harvard Health Publishing. (2016). Understanding the heart attack gender gap. In Harvard Medical School. https://www.health.harvard.edu/blog/understanding-heart-attack-gender-gap-201604159495

Kuhn, & Max. (2008). Building predictive models in r using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05

Monash Online. (2025). Heart data [dataset]. https://learning.monash.edu/mod/page/view.php?id=4579268

Mukamal, K. J. (2006). The effects of smoking and drinking on cardiovascular disease and risk factors. Alcohol Research & Health : The Journal of the National Institute on Alcohol Abuse and Alcoholism, 29(3), 199–202. https://pmc.ncbi.nlm.nih.gov/articles/PMC6527044/

Müller, K., & Wickham, H. (2025). Tibble: Simple data frames. https://tibble.tidyverse.org/

National Heart, Lung, and Blood Institute. (2025). Calculate your BMI. https://www.nhlbi.nih.gov/calculate-your-bmi

National Institute of Diabetes and Digestive and Kidney Diseases. (2021). Diabetes, heart disease, & stroke. https://www.niddk.nih.gov/health-information/diabetes/overview/preventing-problems/heart-disease-stroke

OpenAI. (2024). ChatGPT (GPT-4-turbo, may 2024 version). https://chat.openai.com/

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for r and s+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.

Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. https://plotly-r.com

Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P. (2025). Plotly: Create interactive web graphics via plotly.js. https://plotly-r.com

Spinu, V., Grolemund, G., & Wickham, H. (2024). Lubridate: Make dealing with dates a little easier. https://lubridate.tidyverse.org

Therneau, T., & Atkinson, B. (2023). Rpart: Recursive partitioning and regression trees. https://CRAN.R-project.org/package=rpart

Villines, Z. (2023). Can an EKG predict a person’s risk of having a heart attack? In Medical News Today. https://www.medicalnewstoday.com/articles/ekg-and-heart-attack-risk

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

Wickham, H. (2023a). Forcats: Tools for working with categorical variables (factors). https://forcats.tidyverse.org/

Wickham, H. (2023b). Stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org

Wickham, H. (2023c). Tidyverse: Easily install and load the tidyverse. https://tidyverse.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., & Bryan, J. (2025). Readxl: Read excel files. https://readxl.tidyverse.org

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., Dunnington, D., & van den Brand, T. (2025). ggplot2: Create elegant data visualisations using the grammar of graphics. https://ggplot2.tidyverse.org

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://dplyr.tidyverse.org

Wickham, H., & Grolemund, G. (2017). R for data science. https://r4ds.had.co.nz/index.html

Wickham, H., & Henry, L. (2025). Purrr: Functional programming tools. https://purrr.tidyverse.org/

Wickham, H., Hester, J., & Bryan, J. (2024). Readr: Read rectangular text data. https://readr.tidyverse.org

Wickham, H., Vaughan, D., & Girlich, M. (2024). Tidyr: Tidy messy data. https://tidyr.tidyverse.org

Xie, Y. (2014). Knitr: A comprehensive tool for reproducible research in R. In V. Stodden, F. Leisch, & R. D. Peng (Eds.), Implementing reproducible computational research. Chapman; Hall/CRC.

Xie, Y. (2015). Dynamic documents with R and knitr (2nd ed.). Chapman; Hall/CRC. https://yihui.org/knitr/

Xie, Y. (2025). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr/

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown

Xie, Y., Dervieux, C., & Riederer, E. (2020). R markdown cookbook. Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook

Zhu, H. (2024). kableExtra: Construct complex table with kable and pipe syntax. http://haozhu233.github.io/kableExtra/

LEO5314 - Predictive Modelling of Heart Attack Risk in Hospital Patients

Megan O’Rorke 35602287

2025-09-18

1 Data Preparation

1.1 Data Set Structure

1.2 Data Set Summary

2 Predictive Models

2.1 Logistic Model

2.2 LDA Model

2.3 QDA Model

2.4 k-NN Model

2.5 Decision Tree Model

3 Model Evaluation

3.1 Logistic Model

3.2 LDA Model

3.3 QDA Model

3.4 k-NN Model

3.5 Decision Tree Model

3.6 Overall Evaluation

4 Decision Boundaries Visualisation

4.1 Logistic Model

4.2 LDA Model

4.3 QDA Model

4.4 k-NN Model

4.5 Decision Tree Model

5 Summary and Recommendation

References