Introduction

This dataset contains information from a cardiovascular disease study. It includes various risk factors for coronary heart disease (CHD) and CHD outcomes of the patients. The dataset is divided into two parts: a training dataset and a testing dataset. There are 16 variables in total, we use “TenYearCHD” as the target variable. Our goal is to predict the binary response variable TenYearCHD as best as possible. We compared four XGBoost models under 5-fold cross-validation: (1) complete-case analysis (dropping rows with missing values), (2) median imputation for numeric variables with an explicit “Missing” category for categorical variables plus log transforms, (3) a feature-selected model using only the top predictors from model (2), and (4) using class-weights to handle imbalanced data.

XGBoost

XGBoost a very popular model and a version of gradient-boosted decision trees. It is an ensemble of many small decision trees where it starts with a simple prediction then repeatedly fits new trees to the residuals of the current model. XGBoost’s popularity comes from its ability to handle non-linear effects and interactions automatically, work well with different types of data, and most importantly, usually give state-of-the-art performance on tabular data. There are very few statistical assumptions, but practical assumptions include: train/test data from same distribution, independent observations, and labels are mostly correct.

library(tidyverse)
library(caret)
library(xgboost)
xgboost::xgb.set.config(verbosity = 0)

## [1] TRUE

library(pROC)
library(dplyr)
library(knitr)
library(kableExtra)
library(scales)
library(forcats)

# Load training and testing data

load("chd_train.RData")
load("chd_test.RData")

# Copy to data frame
train_df <- chd_train
test_df  <- chd_test

Preprocessing

We convert TenYearCHD to a “No/Yes” factor and male, currentSmoker, etc. as binary predictors. We handle the missing values later.

# Factor TenYearCHD
train_df$TenYearCHD <- factor(train_df$TenYearCHD, 
                              levels = c(0, 1),
                              labels = c("No", "Yes"))
test_df$TenYearCHD  <- factor(test_df$TenYearCHD, 
                              levels = c(0, 1), 
                              labels = c("No", "Yes"))

# Binary predictors to factors
bin_vars <- c("male","education",
              "currentSmoker","BPMeds", 
              "prevalentStroke","prevalentHyp",
              "diabetes")

train_df[bin_vars] <- lapply(train_df[bin_vars], factor)
test_df[bin_vars]  <- lapply(test_df[bin_vars], factor)

# Check missing
colSums(is.na(train_df))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0              80               0              20 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##              41               0               0               0              34 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0              16               1             312 
##      TenYearCHD 
##               0

colSums(is.na(test_df))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0              25               0               9 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##              12               0               0               0              16 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0               3               0              76 
##      TenYearCHD 
##               0

# look at probability of TenYearCHD by the categorical variables
prob_by_cat <- function(df, var) {
                  df %>%
                  filter(!is.na(.data[[var]])) %>%
                  group_by(level = .data[[var]]) %>%
                  summarize(
                  n = n(),
                  P_CHD = mean(TenYearCHD == "Yes"),
                  .groups = "drop"
                  ) %>%
                  mutate(var = var) %>%
                  relocate(var, level)
}

cat_vars <- bin_vars

prob_cat_list <- lapply(cat_vars, function(v) prob_by_cat(train_df, v))
prob_cat <- bind_rows(prob_cat_list)

prob_cat_table <- prob_cat %>%
  arrange(var, desc(P_CHD)) %>%              
  mutate(
    P_CHD = scales::percent(P_CHD, accuracy = 0.1)  # convert to %
  ) %>%
  rename(
    Variable = var,
    Level    = level,
    `Count (n)` = n,
    `P(TenYearCHD = Yes)` = P_CHD
  )

kable(prob_cat_table,
      caption = "Probability of TenYearCHD by categorical predictors",
      booktabs = TRUE,
      align = "llrc") %>%
  kable_styling(full_width = FALSE)

Probability of TenYearCHD by categorical predictors
Variable	Level	Count (n)	P(TenYearCHD = Yes)
BPMeds	1	94	31.9%
BPMeds	0	3260	14.7%
currentSmoker	1	1660	16.5%
currentSmoker	0	1735	13.9%
diabetes	1	91	34.1%
diabetes	0	3304	14.7%
education	0	2407	15.7%
education	1	908	13.7%
male	1	1470	19.2%
male	0	1925	12.2%
prevalentHyp	1	1034	25.0%
prevalentHyp	0	2361	10.9%
prevalentStroke	1	22	45.5%
prevalentStroke	0	3373	15.0%

Probability of TenYearCHD by categorical risk variables

For categorical variables, we group by each level. We use n (number of people in the category) and P_CHD (proportion of those with TenYearCHD as “Yes”). Mathematically it is P(TenYearCHD = ‘Yes’ | person in category).

# Look at probability of TenYearCHD by numerical variables
prob_by_num <- function(df, var, n_bins = 5) {
  x <- df[[var]]
  brks <- quantile(
                x,
                probs = seq(0, 1, length.out = n_bins + 1),
                na.rm = TRUE,
                names = FALSE
              )
  brks <- unique(brks)
  
  cut_x <- cut(x, breaks = brks, include.lowest = TRUE)
  
  df %>%
  mutate(bin = cut_x) %>%
  filter(!is.na(bin)) %>%
  group_by(bin) %>%
  summarize(n    = n(),
            P_CHD = mean(TenYearCHD == "Yes"),
            .groups = "drop"
            ) %>%
  mutate(var = var,
         bin = as.character(bin)) %>%
  relocate(var, bin)
  }

num_vars <- setdiff(names(train_df),
c(cat_vars, "TenYearCHD"))

prob_num_list <- lapply(num_vars, function(v) prob_by_num(train_df, v))
prob_num <- bind_rows(prob_num_list)

prob_num_table <- prob_num %>%
  arrange(var, bin) %>%
  mutate(
    P_CHD = scales::percent(P_CHD, accuracy = 0.1)
  ) %>%
  rename(
    Variable = var,
    `Value bin` = bin,
    `Count (n)` = n,
    `P(TenYearCHD = Yes)` = P_CHD
  )

kable(
  prob_num_table,
  caption = "Probability of TenYearCHD by binned numeric predictors",
  booktabs = TRUE,
  align = "llrc"
) %>%
  kable_styling(full_width = FALSE)

Probability of TenYearCHD by binned numeric predictors
Variable	Value bin	Count (n)	P(TenYearCHD = Yes)
BMI	(22.5,24.5]	665	12.3%
BMI	(24.5,26.3]	675	14.5%
BMI	(26.3,28.7]	676	16.3%
BMI	(28.7,56.8]	676	20.1%
BMI	[15.5,22.5]	687	11.9%
age	(41,46]	693	8.9%
age	(46,52]	697	17.9%
age	(52,58]	647	19.0%
age	(58,70]	635	26.0%
age	[33,41]	723	5.7%
cigsPerDay	(20,70]	375	21.3%
cigsPerDay	(9,20]	883	17.0%
cigsPerDay	[0,9]	2117	13.4%
diaBP	(73,80]	815	12.4%
diaBP	(80,85]	590	14.2%
diaBP	(85,92]	629	15.3%
diaBP	(92,142]	631	25.2%
diaBP	[48,73]	730	10.4%
glucose	(70,75]	558	15.2%
glucose	(75,81]	607	14.7%
glucose	(81,89]	612	13.6%
glucose	(89,394]	594	20.4%
glucose	[40,70]	712	13.9%
heartRate	(66,72]	690	14.3%
heartRate	(72,77]	627	14.7%
heartRate	(77,85]	726	14.5%
heartRate	(85,143]	608	16.6%
heartRate	[44,66]	743	15.9%
sysBP	(114,124]	717	10.2%
sysBP	(124,133]	654	13.6%
sysBP	(133,148]	659	16.7%
sysBP	(148,295]	676	28.1%
sysBP	[83.5,114]	689	7.8%
totChol	(200,223]	633	15.2%
totChol	(223,245]	693	14.0%
totChol	(245,272]	671	16.7%
totChol	(272,696]	650	19.8%
totChol	[107,200]	714	10.5%

Probability of TenYearCHD by numerical risk variables

We compute quantile based bins. For each bin we have n (number of people in that bin) and P_CHD (proportion of those people who are positive for TenYearCHD). Similarly, it can be expressed as P(TenYearCHD = ‘Yes’ | person in bin). We also do univariate logistic regression to examine the relationship between each variable and TenYearCHD. From the plots we can see that there is a monotone and roughly linear relationship between each risk factor and TenYearCHD.

for (var in num_vars) {
  # Fit uni. logistic regression
  form <- as.formula(paste("TenYearCHD", "~", var))
  fit  <- glm(form, data = train_df, family = binomial)

  # grid of vars
  x_seq <- seq(min(train_df[[var]], na.rm = TRUE),
               max(train_df[[var]], na.rm = TRUE),
               length.out = 100)

  newdat <- data.frame(x = x_seq)
  names(newdat) <- var

  newdat$pred_prob <- predict(fit, newdata = newdat, type = "response")

  plt <- ggplot(newdat, aes(x = .data[[var]], y = pred_prob)) +
    geom_line() +
    labs(
      title = paste0("Logistic regression: TenYearCHD=Yes vs ", var),
      x = var,
      y = "Predicted TenYearCHD"
    ) +
    theme_minimal()
  print(plt)
}

age_prob <- prob_num %>%
  filter(var == "age") %>%
  mutate(bin = factor(bin, levels = unique(bin)))  # keep current order

ggplot(age_prob,
       aes(x = bin, y = P_CHD, group = 1)) + 
  geom_point() + 
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Age bins", 
       y = "P(TenYearCHD = Yes)",
       title = "10-year CHD risk by age (binned, train data)")

ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,  # AUC, Sens, Spec
  savePredictions = "final"
)

# XGBoost Hyperparam grid

xgb_grid <- expand.grid(
  nrounds = c(200, 400, 600),
  max_depth = c(3, 4, 5),
  eta = c(0.001, 0.01, 0.05),
  gamma = c(0, 1),
  colsample_bytree = c(0.7, 0.9),
  min_child_weight = c(1, 5),
  subsample = c(0.7, 0.9)
)

train_drop <- train_df %>% drop_na()
test_drop  <- test_df  %>% drop_na()

sapply(train_drop, function(x) sum(is.na(x)))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0               0               0               0 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##               0               0               0               0               0 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0               0               0               0 
##      TenYearCHD 
##               0

sapply(test_drop,  function(x) sum(is.na(x)))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0               0               0               0 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##               0               0               0               0               0 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0               0               0               0 
##      TenYearCHD 
##               0

full_formula_drop <- TenYearCHD ~ .

xgb_model_drop <- train(
  full_formula_drop,
  data = train_drop,
  method = "xgbTree",
  trControl = ctrl,
  preProcess = NULL,
  tuneGrid = xgb_grid,
  metric = "ROC",
  verbose = FALSE
)

xgb_model_drop$bestTune

##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 152     400         3 0.01     0              0.7                5       0.7

xgb_importance_drop <- varImp(xgb_model_drop, scale = FALSE)
xgb_importance_drop

## xgbTree variable importance
## 
##                   Overall
## age              0.271352
## sysBP            0.170131
## glucose          0.101348
## diaBP            0.089765
## totChol          0.072597
## male1            0.072472
## cigsPerDay       0.070348
## BMI              0.053043
## prevalentHyp1    0.043559
## heartRate        0.039586
## BPMeds1          0.006161
## currentSmoker1   0.005436
## diabetes1        0.003002
## education1       0.001201
## prevalentStroke1 0.000000

Impute Missing Data

Missing numeric predictors were imputed using median values. XGBoost trees typically do not need normalizations or log transforms; however, log transforms were used to reduce right skew and stabilize the effect of extreme values for totChol, glucose, and cigsPerDay.

# Imputed data

train_imp <- train_df
test_imp  <- test_df

# Create "Missing" category

train_imp[bin_vars] <- lapply(train_imp[bin_vars], 
                              function(x) {
                                x <- factor(x)
                                fct_explicit_na(x, na_level = "Missing")
                                }
                              )
test_imp[bin_vars] <- lapply(test_imp[bin_vars], 
                             function(x) {
                               x <- factor(x)
                               fct_explicit_na(x, na_level = "Missing")
                               }
                             )

# Log transform numeric

train_imp <- train_imp %>%
  mutate(log_totChol   = log1p(totChol),
         log_glucose   = log1p(glucose),
         log_cigsPerDay = log1p(cigsPerDay)
         )
test_imp <- test_imp %>%
  mutate(log_totChol   = log1p(totChol),
         log_glucose   = log1p(glucose),
         log_cigsPerDay = log1p(cigsPerDay)
         )

# Median impute missing numeric

num_vars_imp <- names(train_imp)[sapply(train_imp, is.numeric)]
num_vars_imp <- setdiff(num_vars_imp, "TenYearCHD")

pre_impute <- preProcess(
  train_imp[, num_vars_imp],
  method = "medianImpute"
)

train_num_imp <- predict(pre_impute, train_imp[, num_vars_imp])
test_num_imp  <- predict(pre_impute, test_imp[, num_vars_imp])

train_imp[, num_vars_imp] <- train_num_imp
test_imp[,  num_vars_imp] <- test_num_imp

# Double check for missing values
sapply(train_imp, function(x) sum(is.na(x)))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0               0               0               0 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##               0               0               0               0               0 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0               0               0               0 
##      TenYearCHD     log_totChol     log_glucose  log_cigsPerDay 
##               0               0               0               0

sapply(test_imp,  function(x) sum(is.na(x)))

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0               0               0               0 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##               0               0               0               0               0 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0               0               0               0 
##      TenYearCHD     log_totChol     log_glucose  log_cigsPerDay 
##               0               0               0               0

full_formula_imp <- TenYearCHD ~ .

xgb_model_imp <- train(
  full_formula_imp,
  data = train_imp,
  method = "xgbTree",
  trControl = ctrl,
  preProcess = NULL,
  tuneGrid = xgb_grid,
  metric = "ROC",
  verbose = FALSE
  )

xgb_model_imp$bestTune

##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 188     400         3 0.01     1              0.9                5       0.7

xgb_importance_imp <- varImp(xgb_model_imp, scale = FALSE)
xgb_importance_imp

## xgbTree variable importance
## 
##                   Overall
## age              0.269206
## sysBP            0.170146
## glucose          0.090918
## cigsPerDay       0.085309
## diaBP            0.081057
## totChol          0.074386
## male1            0.063019
## BMI              0.047483
## prevalentHyp1    0.045286
## heartRate        0.034182
## log_glucose      0.012609
## log_cigsPerDay   0.011042
## BPMeds1          0.005668
## log_totChol      0.004032
## currentSmoker1   0.002999
## education1       0.001539
## diabetes1        0.001120
## educationMissing 0.000000
## BPMedsMissing    0.000000
## prevalentStroke1 0.000000

plot(xgb_importance_imp, top = 15,
main = "XGBoost (imputed, all features) importance")

# number of top predictors to use
k_top <- 4  

top_vars_raw <- rownames(xgb_importance_imp$importance) %>% head(k_top)
top_vars_raw

## [1] "age"        "sysBP"      "glucose"    "cigsPerDay"

orig_names <- names(train_imp)

map_back <- function(v) {
  if (v %in% orig_names) 
    return(v)
  base <- sub("[0-9]+$", "", v)  # drop numbers at end 
  if (base %in% orig_names) 
    return(base)
  return(NA_character_)
}

top_vars_mapped <- sapply(top_vars_raw, map_back)
top_vars <- unique(na.omit(top_vars_mapped))
top_vars

## [1] "age"        "sysBP"      "glucose"    "cigsPerDay"

fs_formula_imp <- as.formula(paste("TenYearCHD ~", paste(top_vars, collapse = " + ")))
fs_formula_imp

## TenYearCHD ~ age + sysBP + glucose + cigsPerDay

xgb_model_imp_fs <- train(
  fs_formula_imp,
  data = train_imp,
  method = "xgbTree",
  trControl = ctrl,
  preProcess = NULL,
  tuneGrid = xgb_grid,
  metric = "ROC",
  verbose = FALSE
)
xgb_model_imp_fs$bestTune

##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 157     200         3 0.01     0              0.9                1       0.7

xgb_importance_imp_fs <- varImp(xgb_model_imp_fs, scale = FALSE)
xgb_importance_imp_fs

## xgbTree variable importance
## 
##            Overall
## age         0.3665
## sysBP       0.3286
## glucose     0.1644
## cigsPerDay  0.1404

plot(xgb_importance_imp_fs, top = length(top_vars),
main = "XGBoost (imputed, feature-selected) importance")

# positive-class weight (# negatives / # positives)

pos <- sum(train_imp$TenYearCHD == "Yes")
neg <- sum(train_imp$TenYearCHD == "No")
pos_weight <- neg / pos
pos_weight

## [1] 5.579457

xgb_model_imp_w <- train(
  full_formula_imp,
  data = train_imp,
  method = "xgbTree",
  trControl = ctrl,
  preProcess = NULL,
  tuneGrid = xgb_grid,
  metric = "ROC",
  verbose = FALSE,
  scale_pos_weight = pos_weight   # weighted classes
)
xgb_model_imp_w$bestTune

##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 164     400         3 0.01     0              0.9                5       0.7

xgb_importance_imp_w <- varImp(xgb_model_imp_w, scale = FALSE)
xgb_importance_imp_w

## xgbTree variable importance
## 
##                    Overall
## age              0.2504253
## sysBP            0.1493489
## glucose          0.1087882
## diaBP            0.1074715
## cigsPerDay       0.0913876
## male1            0.0699121
## totChol          0.0679617
## prevalentHyp1    0.0474935
## BMI              0.0413467
## heartRate        0.0268458
## log_glucose      0.0100374
## log_cigsPerDay   0.0091353
## log_totChol      0.0071290
## BPMeds1          0.0055232
## prevalentStroke1 0.0032164
## diabetes1        0.0018981
## currentSmoker1   0.0018427
## education1       0.0002366
## educationMissing 0.0000000
## BPMedsMissing    0.0000000

Methods

Exploratory analysis

We summarized CHD risk by category for the binary predictors. For example, males had higher observed CHD risk than females (19% vs. 12%), and individuals with prevalent hypertension had more than double the risk compared with those without hypertension (25% vs. 11%). Prior stroke was rare but associated with extremely high risk (46%).

For numeric predictors, we binned each variable into approximate quintiles and computed the proportion of TenYearCHD = “Yes” within each bin. Age showed a strong monotonic relationship with risk: the CHD rate increased from around 6% in the youngest group (ages 33–41) to roughly 26% in the oldest group (58–70). Similar upward trends were observed for systolic blood pressure, glucose, and total cholesterol.

These patterns align with the established cardiovascular risk factors and motivate the use of non-linear models such as gradient boosted trees, which can capture interactions and non-linear effects without manual tuning

Modeling

We used the caret::train interface with method = “xgbTree” to fit XGBoost models. Since there is significant class imbalance, performance was evaluated using 5-fold cross-validation with the ROC metric (twoClassSummary), and we tuned over a fairly wide hyperparameter grid:

nrounds: {200, 400, 600}
- How many trees to fit in the ensemble
max_depth: {3, 4, 5}
- max depth of each individual tree
eta: {0.001, 0.01, 0.05}
- learning rate (shrinks how much each new tree can change the predictions)
gamma: {0, 1}
- minimum loss reduction to make a split
colsample_bytree: {0.7, 0.9}
- fraction of features randomly sampled for each tree
min_child_weight: {1, 5}
- minimum total weight needed in a leaf
subsample: {0.7, 0.9}
- fraction of training observations used to fit each tree

In order words, nrounds / max_depth control model size/complexity while eta, gamma, min_child_weight, subsample, colsample_bytree are tools that trade off fit versus overfitting.

We considered four main models:
Model 1 – XGBoost using all predictors and dropping missing values

Model 2 – XGBoost with imputed predictors using all features

Model 3 – XGBoost using imputed predictors with feature selection from Model 2
- We selected the top 4 features: age, sysBP, glucose, cigsPerDay
- Feature selection using XGBoost variable importance ranks predictors by how much they reduce the training loss across all trees in the ensemble.

Model 4 - XGBoost using imputed predictors and weighted classes

evaluate_model <- function(model, test_data, 
                           model_name = "model", 
                           threshold = 0.2) {
  # ensure threshold is scalar
  
  threshold <- as.numeric(threshold[1])
  probs <- predict(model, newdata = test_data, type = "prob")[, "Yes"]
  preds <- ifelse(probs >= threshold, "Yes", "No")
  preds <- factor(preds, levels = c("No", "Yes"))
  
  cm <- confusionMatrix(preds, test_data$TenYearCHD, positive = "Yes")
  
  roc_obj <- roc(response = test_data$TenYearCHD,
  predictor = probs,
  levels = c("No", "Yes"))
  auc_val <- auc(roc_obj)
  
  cat("\n==============================\n")
  cat("Model:", model_name, "| Threshold:", threshold, "\n")
  cat("==============================\n")
  print(cm)
  cat("\nAUC:", as.numeric(auc_val), "\n")
  
  list(probs = probs,
       preds = preds,
       cm    = cm,
       roc   = roc_obj,
       auc   = auc_val,
       threshold = threshold
       )
  }

# 1) XGBoost with dropped missing values
eval_xgb_drop <- evaluate_model(xgb_model_drop, test_drop, "XGBoost (drop NAs)")

## 
## ==============================
## Model: XGBoost (drop NAs) | Threshold: 0.2 
## ==============================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  474  52
##        Yes 141  58
##                                        
##                Accuracy : 0.7338       
##                  95% CI : (0.7, 0.7656)
##     No Information Rate : 0.8483       
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.2237       
##                                        
##  Mcnemar's Test P-Value : 2.383e-10    
##                                        
##             Sensitivity : 0.5273       
##             Specificity : 0.7707       
##          Pos Pred Value : 0.2915       
##          Neg Pred Value : 0.9011       
##              Prevalence : 0.1517       
##          Detection Rate : 0.0800       
##    Detection Prevalence : 0.2745       
##       Balanced Accuracy : 0.6490       
##                                        
##        'Positive' Class : Yes          
##                                        
## 
## AUC: 0.7224686

# 2) XGBoost with imputed data
eval_xgb_imp <- evaluate_model(xgb_model_imp, test_imp, "XGBoost (imputed, all features)")

## 
## ==============================
## Model: XGBoost (imputed, all features) | Threshold: 0.2 
## ==============================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  557  59
##        Yes 160  69
##                                           
##                Accuracy : 0.7408          
##                  95% CI : (0.7099, 0.7701)
##     No Information Rate : 0.8485          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2386          
##                                           
##  Mcnemar's Test P-Value : 1.405e-11       
##                                           
##             Sensitivity : 0.53906         
##             Specificity : 0.77685         
##          Pos Pred Value : 0.30131         
##          Neg Pred Value : 0.90422         
##              Prevalence : 0.15148         
##          Detection Rate : 0.08166         
##    Detection Prevalence : 0.27101         
##       Balanced Accuracy : 0.65796         
##                                           
##        'Positive' Class : Yes             
##                                           
## 
## AUC: 0.731226

# 3) XGBoost with imputed data and feature selection
eval_xgb_imp_fs <- evaluate_model(xgb_model_imp_fs, test_imp, "XGBoost (imputed, feature-selected)")

## 
## ==============================
## Model: XGBoost (imputed, feature-selected) | Threshold: 0.2 
## ==============================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  453  35
##        Yes 264  93
##                                           
##                Accuracy : 0.6462          
##                  95% CI : (0.6129, 0.6784)
##     No Information Rate : 0.8485          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2066          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7266          
##             Specificity : 0.6318          
##          Pos Pred Value : 0.2605          
##          Neg Pred Value : 0.9283          
##              Prevalence : 0.1515          
##          Detection Rate : 0.1101          
##    Detection Prevalence : 0.4225          
##       Balanced Accuracy : 0.6792          
##                                           
##        'Positive' Class : Yes             
##                                           
## 
## AUC: 0.7355191

# 4) XGBoost with imputed data and class weights
eval_xgb_imp_w <- evaluate_model(xgb_model_imp_w, test_imp, "XGBoost (imputed, weighted)")

## 
## ==============================
## Model: XGBoost (imputed, weighted) | Threshold: 0.2 
## ==============================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  715 124
##        Yes   2   4
##                                           
##                Accuracy : 0.8509          
##                  95% CI : (0.8251, 0.8742)
##     No Information Rate : 0.8485          
##     P-Value [Acc > NIR] : 0.4471          
##                                           
##                   Kappa : 0.0468          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.031250        
##             Specificity : 0.997211        
##          Pos Pred Value : 0.666667        
##          Neg Pred Value : 0.852205        
##              Prevalence : 0.151479        
##          Detection Rate : 0.004734        
##    Detection Prevalence : 0.007101        
##       Balanced Accuracy : 0.514230        
##                                           
##        'Positive' Class : Yes             
##                                           
## 
## AUC: 0.7350124

auc_compare <- data.frame(
  Model = c("XGB - drop NAs",
  "XGB - imputed (full)",
  "XGB - imputed (FS)",
  "XGB - imputed (weighted, 0.5)"),
  AUC = c(as.numeric(eval_xgb_drop$auc),
  as.numeric(eval_xgb_imp$auc),
  as.numeric(eval_xgb_imp_fs$auc),
  as.numeric(eval_xgb_imp_w$auc))
  ) %>% arrange(desc(AUC))

auc_compare

##                           Model       AUC
## 1            XGB - imputed (FS) 0.7355191
## 2 XGB - imputed (weighted, 0.5) 0.7350124
## 3          XGB - imputed (full) 0.7312260
## 4                XGB - drop NAs 0.7224686

plot(eval_xgb_imp$roc,    
     legacy.axes = TRUE, lwd = 2,
     main = "ROC curves (test set, imputed data)",
     col = "black")

plot(eval_xgb_imp_fs$roc, 
     add = TRUE, lwd = 2, lty = 2,
     col = "red")

plot(eval_xgb_imp_w$roc,  
     add = TRUE, lwd = 2, lty = 3,
     col = "blue")

legend("bottomright",
       legend = c("XGB imputed (full)",
                  "XGB imputed (FS)",
                  "XGB imputed (weighted)"),
       lty = c(1, 2, 3),
       lwd = 2,
       col = c("black", "red", "blue"))

Summary and Conclusion

In general, probability of having 10-year risk of CHD increases with an increase in each risk variable. As expected, 10-year CHD risk increased monotonically with age. In the youngest age group (33–41), the observed CHD rate was about 6%, compared with around 26% in the oldest group (58–70). Males, individuals with prevalent hypertension, and those with prior stroke or diabetes had substantially higher observed CHD risk.

Our XGBoost (imputed, feature-selected) model achieved the best AUC of about 0.74 on the test set, indicating moderate discriminative ability. Using a threshold of 0.2, the model obtained sensitivity of around 0.73 and specificity around 0.63 (balanced accuracy = 0.68). Thus, it correctly identifies most patients who will develop CHD, at the cost of a relatively low positive predictive value (around 0.26).

At the same time, the imputed, feature-selected model achieved similar test AUC (slightly higher) to the full-feature models, while using far fewer predictors. This suggests that most of the predictive signal is captured by age, systolic blood pressure, glucose, and cigarettes per day, and additional variables add little beyond noise.

Additionally, all imputed data models performed better than simply dropping all missing values, demonstrating that imputing missing values or adding a “Missing” category can improve performance.

Practically, the model provides a moderate fit to the data. It is useful as a risk stratification or screening tool to rule out low-risk individuals, but it is not accurate enough to be used as a stand-alone diagnostic classifier.

HDS 5300 Final Project

Philmore Koung

2025-11-28