This dataset contains information from a cardiovascular disease study. It includes various risk factors for coronary heart disease (CHD) and CHD outcomes of the patients. The dataset is divided into two parts: a training dataset and a testing dataset. There are 16 variables in total, we use “TenYearCHD” as the target variable. Our goal is to predict the binary response variable TenYearCHD as best as possible. We compared four XGBoost models under 5-fold cross-validation: (1) complete-case analysis (dropping rows with missing values), (2) median imputation for numeric variables with an explicit “Missing” category for categorical variables plus log transforms, (3) a feature-selected model using only the top predictors from model (2), and (4) using class-weights to handle imbalanced data.
XGBoost a very popular model and a version of gradient-boosted decision trees. It is an ensemble of many small decision trees where it starts with a simple prediction then repeatedly fits new trees to the residuals of the current model. XGBoost’s popularity comes from its ability to handle non-linear effects and interactions automatically, work well with different types of data, and most importantly, usually give state-of-the-art performance on tabular data. There are very few statistical assumptions, but practical assumptions include: train/test data from same distribution, independent observations, and labels are mostly correct.
library(tidyverse)
library(caret)
library(xgboost)
xgboost::xgb.set.config(verbosity = 0)
## [1] TRUE
library(pROC)
library(dplyr)
library(knitr)
library(kableExtra)
library(scales)
library(forcats)
# Load training and testing data
load("chd_train.RData")
load("chd_test.RData")
# Copy to data frame
train_df <- chd_train
test_df <- chd_test
We convert TenYearCHD to a “No/Yes” factor and male, currentSmoker, etc. as binary predictors. We handle the missing values later.
# Factor TenYearCHD
train_df$TenYearCHD <- factor(train_df$TenYearCHD,
levels = c(0, 1),
labels = c("No", "Yes"))
test_df$TenYearCHD <- factor(test_df$TenYearCHD,
levels = c(0, 1),
labels = c("No", "Yes"))
# Binary predictors to factors
bin_vars <- c("male","education",
"currentSmoker","BPMeds",
"prevalentStroke","prevalentHyp",
"diabetes")
train_df[bin_vars] <- lapply(train_df[bin_vars], factor)
test_df[bin_vars] <- lapply(test_df[bin_vars], factor)
# Check missing
colSums(is.na(train_df))
## male age education currentSmoker cigsPerDay
## 0 0 80 0 20
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 41 0 0 0 34
## sysBP diaBP BMI heartRate glucose
## 0 0 16 1 312
## TenYearCHD
## 0
colSums(is.na(test_df))
## male age education currentSmoker cigsPerDay
## 0 0 25 0 9
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 12 0 0 0 16
## sysBP diaBP BMI heartRate glucose
## 0 0 3 0 76
## TenYearCHD
## 0
# look at probability of TenYearCHD by the categorical variables
prob_by_cat <- function(df, var) {
df %>%
filter(!is.na(.data[[var]])) %>%
group_by(level = .data[[var]]) %>%
summarize(
n = n(),
P_CHD = mean(TenYearCHD == "Yes"),
.groups = "drop"
) %>%
mutate(var = var) %>%
relocate(var, level)
}
cat_vars <- bin_vars
prob_cat_list <- lapply(cat_vars, function(v) prob_by_cat(train_df, v))
prob_cat <- bind_rows(prob_cat_list)
prob_cat_table <- prob_cat %>%
arrange(var, desc(P_CHD)) %>%
mutate(
P_CHD = scales::percent(P_CHD, accuracy = 0.1) # convert to %
) %>%
rename(
Variable = var,
Level = level,
`Count (n)` = n,
`P(TenYearCHD = Yes)` = P_CHD
)
kable(prob_cat_table,
caption = "Probability of TenYearCHD by categorical predictors",
booktabs = TRUE,
align = "llrc") %>%
kable_styling(full_width = FALSE)
| Variable | Level | Count (n) | P(TenYearCHD = Yes) |
|---|---|---|---|
| BPMeds | 1 | 94 | 31.9% |
| BPMeds | 0 | 3260 | 14.7% |
| currentSmoker | 1 | 1660 | 16.5% |
| currentSmoker | 0 | 1735 | 13.9% |
| diabetes | 1 | 91 | 34.1% |
| diabetes | 0 | 3304 | 14.7% |
| education | 0 | 2407 | 15.7% |
| education | 1 | 908 | 13.7% |
| male | 1 | 1470 | 19.2% |
| male | 0 | 1925 | 12.2% |
| prevalentHyp | 1 | 1034 | 25.0% |
| prevalentHyp | 0 | 2361 | 10.9% |
| prevalentStroke | 1 | 22 | 45.5% |
| prevalentStroke | 0 | 3373 | 15.0% |
For categorical variables, we group by each level. We use n (number of people in the category) and P_CHD (proportion of those with TenYearCHD as “Yes”). Mathematically it is P(TenYearCHD = ‘Yes’ | person in category).
# Look at probability of TenYearCHD by numerical variables
prob_by_num <- function(df, var, n_bins = 5) {
x <- df[[var]]
brks <- quantile(
x,
probs = seq(0, 1, length.out = n_bins + 1),
na.rm = TRUE,
names = FALSE
)
brks <- unique(brks)
cut_x <- cut(x, breaks = brks, include.lowest = TRUE)
df %>%
mutate(bin = cut_x) %>%
filter(!is.na(bin)) %>%
group_by(bin) %>%
summarize(n = n(),
P_CHD = mean(TenYearCHD == "Yes"),
.groups = "drop"
) %>%
mutate(var = var,
bin = as.character(bin)) %>%
relocate(var, bin)
}
num_vars <- setdiff(names(train_df),
c(cat_vars, "TenYearCHD"))
prob_num_list <- lapply(num_vars, function(v) prob_by_num(train_df, v))
prob_num <- bind_rows(prob_num_list)
prob_num_table <- prob_num %>%
arrange(var, bin) %>%
mutate(
P_CHD = scales::percent(P_CHD, accuracy = 0.1)
) %>%
rename(
Variable = var,
`Value bin` = bin,
`Count (n)` = n,
`P(TenYearCHD = Yes)` = P_CHD
)
kable(
prob_num_table,
caption = "Probability of TenYearCHD by binned numeric predictors",
booktabs = TRUE,
align = "llrc"
) %>%
kable_styling(full_width = FALSE)
| Variable | Value bin | Count (n) | P(TenYearCHD = Yes) |
|---|---|---|---|
| BMI | (22.5,24.5] | 665 | 12.3% |
| BMI | (24.5,26.3] | 675 | 14.5% |
| BMI | (26.3,28.7] | 676 | 16.3% |
| BMI | (28.7,56.8] | 676 | 20.1% |
| BMI | [15.5,22.5] | 687 | 11.9% |
| age | (41,46] | 693 | 8.9% |
| age | (46,52] | 697 | 17.9% |
| age | (52,58] | 647 | 19.0% |
| age | (58,70] | 635 | 26.0% |
| age | [33,41] | 723 | 5.7% |
| cigsPerDay | (20,70] | 375 | 21.3% |
| cigsPerDay | (9,20] | 883 | 17.0% |
| cigsPerDay | [0,9] | 2117 | 13.4% |
| diaBP | (73,80] | 815 | 12.4% |
| diaBP | (80,85] | 590 | 14.2% |
| diaBP | (85,92] | 629 | 15.3% |
| diaBP | (92,142] | 631 | 25.2% |
| diaBP | [48,73] | 730 | 10.4% |
| glucose | (70,75] | 558 | 15.2% |
| glucose | (75,81] | 607 | 14.7% |
| glucose | (81,89] | 612 | 13.6% |
| glucose | (89,394] | 594 | 20.4% |
| glucose | [40,70] | 712 | 13.9% |
| heartRate | (66,72] | 690 | 14.3% |
| heartRate | (72,77] | 627 | 14.7% |
| heartRate | (77,85] | 726 | 14.5% |
| heartRate | (85,143] | 608 | 16.6% |
| heartRate | [44,66] | 743 | 15.9% |
| sysBP | (114,124] | 717 | 10.2% |
| sysBP | (124,133] | 654 | 13.6% |
| sysBP | (133,148] | 659 | 16.7% |
| sysBP | (148,295] | 676 | 28.1% |
| sysBP | [83.5,114] | 689 | 7.8% |
| totChol | (200,223] | 633 | 15.2% |
| totChol | (223,245] | 693 | 14.0% |
| totChol | (245,272] | 671 | 16.7% |
| totChol | (272,696] | 650 | 19.8% |
| totChol | [107,200] | 714 | 10.5% |
We compute quantile based bins. For each bin we have n (number of people in that bin) and P_CHD (proportion of those people who are positive for TenYearCHD). Similarly, it can be expressed as P(TenYearCHD = ‘Yes’ | person in bin). We also do univariate logistic regression to examine the relationship between each variable and TenYearCHD. From the plots we can see that there is a monotone and roughly linear relationship between each risk factor and TenYearCHD.
for (var in num_vars) {
# Fit uni. logistic regression
form <- as.formula(paste("TenYearCHD", "~", var))
fit <- glm(form, data = train_df, family = binomial)
# grid of vars
x_seq <- seq(min(train_df[[var]], na.rm = TRUE),
max(train_df[[var]], na.rm = TRUE),
length.out = 100)
newdat <- data.frame(x = x_seq)
names(newdat) <- var
newdat$pred_prob <- predict(fit, newdata = newdat, type = "response")
plt <- ggplot(newdat, aes(x = .data[[var]], y = pred_prob)) +
geom_line() +
labs(
title = paste0("Logistic regression: TenYearCHD=Yes vs ", var),
x = var,
y = "Predicted TenYearCHD"
) +
theme_minimal()
print(plt)
}
age_prob <- prob_num %>%
filter(var == "age") %>%
mutate(bin = factor(bin, levels = unique(bin))) # keep current order
ggplot(age_prob,
aes(x = bin, y = P_CHD, group = 1)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Age bins",
y = "P(TenYearCHD = Yes)",
title = "10-year CHD risk by age (binned, train data)")
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary, # AUC, Sens, Spec
savePredictions = "final"
)
# XGBoost Hyperparam grid
xgb_grid <- expand.grid(
nrounds = c(200, 400, 600),
max_depth = c(3, 4, 5),
eta = c(0.001, 0.01, 0.05),
gamma = c(0, 1),
colsample_bytree = c(0.7, 0.9),
min_child_weight = c(1, 5),
subsample = c(0.7, 0.9)
)
train_drop <- train_df %>% drop_na()
test_drop <- test_df %>% drop_na()
sapply(train_drop, function(x) sum(is.na(x)))
## male age education currentSmoker cigsPerDay
## 0 0 0 0 0
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 0 0 0 0 0
## sysBP diaBP BMI heartRate glucose
## 0 0 0 0 0
## TenYearCHD
## 0
sapply(test_drop, function(x) sum(is.na(x)))
## male age education currentSmoker cigsPerDay
## 0 0 0 0 0
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 0 0 0 0 0
## sysBP diaBP BMI heartRate glucose
## 0 0 0 0 0
## TenYearCHD
## 0
full_formula_drop <- TenYearCHD ~ .
xgb_model_drop <- train(
full_formula_drop,
data = train_drop,
method = "xgbTree",
trControl = ctrl,
preProcess = NULL,
tuneGrid = xgb_grid,
metric = "ROC",
verbose = FALSE
)
xgb_model_drop$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 152 400 3 0.01 0 0.7 5 0.7
xgb_importance_drop <- varImp(xgb_model_drop, scale = FALSE)
xgb_importance_drop
## xgbTree variable importance
##
## Overall
## age 0.271352
## sysBP 0.170131
## glucose 0.101348
## diaBP 0.089765
## totChol 0.072597
## male1 0.072472
## cigsPerDay 0.070348
## BMI 0.053043
## prevalentHyp1 0.043559
## heartRate 0.039586
## BPMeds1 0.006161
## currentSmoker1 0.005436
## diabetes1 0.003002
## education1 0.001201
## prevalentStroke1 0.000000
Missing numeric predictors were imputed using median values. XGBoost trees typically do not need normalizations or log transforms; however, log transforms were used to reduce right skew and stabilize the effect of extreme values for totChol, glucose, and cigsPerDay.
# Imputed data
train_imp <- train_df
test_imp <- test_df
# Create "Missing" category
train_imp[bin_vars] <- lapply(train_imp[bin_vars],
function(x) {
x <- factor(x)
fct_explicit_na(x, na_level = "Missing")
}
)
test_imp[bin_vars] <- lapply(test_imp[bin_vars],
function(x) {
x <- factor(x)
fct_explicit_na(x, na_level = "Missing")
}
)
# Log transform numeric
train_imp <- train_imp %>%
mutate(log_totChol = log1p(totChol),
log_glucose = log1p(glucose),
log_cigsPerDay = log1p(cigsPerDay)
)
test_imp <- test_imp %>%
mutate(log_totChol = log1p(totChol),
log_glucose = log1p(glucose),
log_cigsPerDay = log1p(cigsPerDay)
)
# Median impute missing numeric
num_vars_imp <- names(train_imp)[sapply(train_imp, is.numeric)]
num_vars_imp <- setdiff(num_vars_imp, "TenYearCHD")
pre_impute <- preProcess(
train_imp[, num_vars_imp],
method = "medianImpute"
)
train_num_imp <- predict(pre_impute, train_imp[, num_vars_imp])
test_num_imp <- predict(pre_impute, test_imp[, num_vars_imp])
train_imp[, num_vars_imp] <- train_num_imp
test_imp[, num_vars_imp] <- test_num_imp
# Double check for missing values
sapply(train_imp, function(x) sum(is.na(x)))
## male age education currentSmoker cigsPerDay
## 0 0 0 0 0
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 0 0 0 0 0
## sysBP diaBP BMI heartRate glucose
## 0 0 0 0 0
## TenYearCHD log_totChol log_glucose log_cigsPerDay
## 0 0 0 0
sapply(test_imp, function(x) sum(is.na(x)))
## male age education currentSmoker cigsPerDay
## 0 0 0 0 0
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## 0 0 0 0 0
## sysBP diaBP BMI heartRate glucose
## 0 0 0 0 0
## TenYearCHD log_totChol log_glucose log_cigsPerDay
## 0 0 0 0
full_formula_imp <- TenYearCHD ~ .
xgb_model_imp <- train(
full_formula_imp,
data = train_imp,
method = "xgbTree",
trControl = ctrl,
preProcess = NULL,
tuneGrid = xgb_grid,
metric = "ROC",
verbose = FALSE
)
xgb_model_imp$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 188 400 3 0.01 1 0.9 5 0.7
xgb_importance_imp <- varImp(xgb_model_imp, scale = FALSE)
xgb_importance_imp
## xgbTree variable importance
##
## Overall
## age 0.269206
## sysBP 0.170146
## glucose 0.090918
## cigsPerDay 0.085309
## diaBP 0.081057
## totChol 0.074386
## male1 0.063019
## BMI 0.047483
## prevalentHyp1 0.045286
## heartRate 0.034182
## log_glucose 0.012609
## log_cigsPerDay 0.011042
## BPMeds1 0.005668
## log_totChol 0.004032
## currentSmoker1 0.002999
## education1 0.001539
## diabetes1 0.001120
## educationMissing 0.000000
## BPMedsMissing 0.000000
## prevalentStroke1 0.000000
plot(xgb_importance_imp, top = 15,
main = "XGBoost (imputed, all features) importance")
# number of top predictors to use
k_top <- 4
top_vars_raw <- rownames(xgb_importance_imp$importance) %>% head(k_top)
top_vars_raw
## [1] "age" "sysBP" "glucose" "cigsPerDay"
orig_names <- names(train_imp)
map_back <- function(v) {
if (v %in% orig_names)
return(v)
base <- sub("[0-9]+$", "", v) # drop numbers at end
if (base %in% orig_names)
return(base)
return(NA_character_)
}
top_vars_mapped <- sapply(top_vars_raw, map_back)
top_vars <- unique(na.omit(top_vars_mapped))
top_vars
## [1] "age" "sysBP" "glucose" "cigsPerDay"
fs_formula_imp <- as.formula(paste("TenYearCHD ~", paste(top_vars, collapse = " + ")))
fs_formula_imp
## TenYearCHD ~ age + sysBP + glucose + cigsPerDay
xgb_model_imp_fs <- train(
fs_formula_imp,
data = train_imp,
method = "xgbTree",
trControl = ctrl,
preProcess = NULL,
tuneGrid = xgb_grid,
metric = "ROC",
verbose = FALSE
)
xgb_model_imp_fs$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 157 200 3 0.01 0 0.9 1 0.7
xgb_importance_imp_fs <- varImp(xgb_model_imp_fs, scale = FALSE)
xgb_importance_imp_fs
## xgbTree variable importance
##
## Overall
## age 0.3665
## sysBP 0.3286
## glucose 0.1644
## cigsPerDay 0.1404
plot(xgb_importance_imp_fs, top = length(top_vars),
main = "XGBoost (imputed, feature-selected) importance")
# positive-class weight (# negatives / # positives)
pos <- sum(train_imp$TenYearCHD == "Yes")
neg <- sum(train_imp$TenYearCHD == "No")
pos_weight <- neg / pos
pos_weight
## [1] 5.579457
xgb_model_imp_w <- train(
full_formula_imp,
data = train_imp,
method = "xgbTree",
trControl = ctrl,
preProcess = NULL,
tuneGrid = xgb_grid,
metric = "ROC",
verbose = FALSE,
scale_pos_weight = pos_weight # weighted classes
)
xgb_model_imp_w$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 164 400 3 0.01 0 0.9 5 0.7
xgb_importance_imp_w <- varImp(xgb_model_imp_w, scale = FALSE)
xgb_importance_imp_w
## xgbTree variable importance
##
## Overall
## age 0.2504253
## sysBP 0.1493489
## glucose 0.1087882
## diaBP 0.1074715
## cigsPerDay 0.0913876
## male1 0.0699121
## totChol 0.0679617
## prevalentHyp1 0.0474935
## BMI 0.0413467
## heartRate 0.0268458
## log_glucose 0.0100374
## log_cigsPerDay 0.0091353
## log_totChol 0.0071290
## BPMeds1 0.0055232
## prevalentStroke1 0.0032164
## diabetes1 0.0018981
## currentSmoker1 0.0018427
## education1 0.0002366
## educationMissing 0.0000000
## BPMedsMissing 0.0000000
We summarized CHD risk by category for the binary predictors. For example, males had higher observed CHD risk than females (19% vs. 12%), and individuals with prevalent hypertension had more than double the risk compared with those without hypertension (25% vs. 11%). Prior stroke was rare but associated with extremely high risk (46%).
For numeric predictors, we binned each variable into approximate quintiles and computed the proportion of TenYearCHD = “Yes” within each bin. Age showed a strong monotonic relationship with risk: the CHD rate increased from around 6% in the youngest group (ages 33–41) to roughly 26% in the oldest group (58–70). Similar upward trends were observed for systolic blood pressure, glucose, and total cholesterol.
These patterns align with the established cardiovascular risk factors and motivate the use of non-linear models such as gradient boosted trees, which can capture interactions and non-linear effects without manual tuning
We used the caret::train interface with method = “xgbTree” to fit XGBoost models. Since there is significant class imbalance, performance was evaluated using 5-fold cross-validation with the ROC metric (twoClassSummary), and we tuned over a fairly wide hyperparameter grid:
In order words, nrounds / max_depth control model size/complexity while eta, gamma, min_child_weight, subsample, colsample_bytree are tools that trade off fit versus overfitting.
We considered four main models:
Model 1 – XGBoost using all
predictors and dropping missing values
Model 2 – XGBoost with imputed predictors using all features
Model 3 – XGBoost using imputed predictors with feature selection
from Model 2
- We selected the top 4 features: age, sysBP, glucose,
cigsPerDay
- Feature selection using XGBoost variable importance
ranks predictors by how much they reduce the training loss across all
trees in the ensemble.
Model 4 - XGBoost using imputed predictors and weighted classes
evaluate_model <- function(model, test_data,
model_name = "model",
threshold = 0.2) {
# ensure threshold is scalar
threshold <- as.numeric(threshold[1])
probs <- predict(model, newdata = test_data, type = "prob")[, "Yes"]
preds <- ifelse(probs >= threshold, "Yes", "No")
preds <- factor(preds, levels = c("No", "Yes"))
cm <- confusionMatrix(preds, test_data$TenYearCHD, positive = "Yes")
roc_obj <- roc(response = test_data$TenYearCHD,
predictor = probs,
levels = c("No", "Yes"))
auc_val <- auc(roc_obj)
cat("\n==============================\n")
cat("Model:", model_name, "| Threshold:", threshold, "\n")
cat("==============================\n")
print(cm)
cat("\nAUC:", as.numeric(auc_val), "\n")
list(probs = probs,
preds = preds,
cm = cm,
roc = roc_obj,
auc = auc_val,
threshold = threshold
)
}
# 1) XGBoost with dropped missing values
eval_xgb_drop <- evaluate_model(xgb_model_drop, test_drop, "XGBoost (drop NAs)")
##
## ==============================
## Model: XGBoost (drop NAs) | Threshold: 0.2
## ==============================
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 474 52
## Yes 141 58
##
## Accuracy : 0.7338
## 95% CI : (0.7, 0.7656)
## No Information Rate : 0.8483
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2237
##
## Mcnemar's Test P-Value : 2.383e-10
##
## Sensitivity : 0.5273
## Specificity : 0.7707
## Pos Pred Value : 0.2915
## Neg Pred Value : 0.9011
## Prevalence : 0.1517
## Detection Rate : 0.0800
## Detection Prevalence : 0.2745
## Balanced Accuracy : 0.6490
##
## 'Positive' Class : Yes
##
##
## AUC: 0.7224686
# 2) XGBoost with imputed data
eval_xgb_imp <- evaluate_model(xgb_model_imp, test_imp, "XGBoost (imputed, all features)")
##
## ==============================
## Model: XGBoost (imputed, all features) | Threshold: 0.2
## ==============================
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 557 59
## Yes 160 69
##
## Accuracy : 0.7408
## 95% CI : (0.7099, 0.7701)
## No Information Rate : 0.8485
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2386
##
## Mcnemar's Test P-Value : 1.405e-11
##
## Sensitivity : 0.53906
## Specificity : 0.77685
## Pos Pred Value : 0.30131
## Neg Pred Value : 0.90422
## Prevalence : 0.15148
## Detection Rate : 0.08166
## Detection Prevalence : 0.27101
## Balanced Accuracy : 0.65796
##
## 'Positive' Class : Yes
##
##
## AUC: 0.731226
# 3) XGBoost with imputed data and feature selection
eval_xgb_imp_fs <- evaluate_model(xgb_model_imp_fs, test_imp, "XGBoost (imputed, feature-selected)")
##
## ==============================
## Model: XGBoost (imputed, feature-selected) | Threshold: 0.2
## ==============================
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 453 35
## Yes 264 93
##
## Accuracy : 0.6462
## 95% CI : (0.6129, 0.6784)
## No Information Rate : 0.8485
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2066
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7266
## Specificity : 0.6318
## Pos Pred Value : 0.2605
## Neg Pred Value : 0.9283
## Prevalence : 0.1515
## Detection Rate : 0.1101
## Detection Prevalence : 0.4225
## Balanced Accuracy : 0.6792
##
## 'Positive' Class : Yes
##
##
## AUC: 0.7355191
# 4) XGBoost with imputed data and class weights
eval_xgb_imp_w <- evaluate_model(xgb_model_imp_w, test_imp, "XGBoost (imputed, weighted)")
##
## ==============================
## Model: XGBoost (imputed, weighted) | Threshold: 0.2
## ==============================
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 715 124
## Yes 2 4
##
## Accuracy : 0.8509
## 95% CI : (0.8251, 0.8742)
## No Information Rate : 0.8485
## P-Value [Acc > NIR] : 0.4471
##
## Kappa : 0.0468
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.031250
## Specificity : 0.997211
## Pos Pred Value : 0.666667
## Neg Pred Value : 0.852205
## Prevalence : 0.151479
## Detection Rate : 0.004734
## Detection Prevalence : 0.007101
## Balanced Accuracy : 0.514230
##
## 'Positive' Class : Yes
##
##
## AUC: 0.7350124
auc_compare <- data.frame(
Model = c("XGB - drop NAs",
"XGB - imputed (full)",
"XGB - imputed (FS)",
"XGB - imputed (weighted, 0.5)"),
AUC = c(as.numeric(eval_xgb_drop$auc),
as.numeric(eval_xgb_imp$auc),
as.numeric(eval_xgb_imp_fs$auc),
as.numeric(eval_xgb_imp_w$auc))
) %>% arrange(desc(AUC))
auc_compare
## Model AUC
## 1 XGB - imputed (FS) 0.7355191
## 2 XGB - imputed (weighted, 0.5) 0.7350124
## 3 XGB - imputed (full) 0.7312260
## 4 XGB - drop NAs 0.7224686
plot(eval_xgb_imp$roc,
legacy.axes = TRUE, lwd = 2,
main = "ROC curves (test set, imputed data)",
col = "black")
plot(eval_xgb_imp_fs$roc,
add = TRUE, lwd = 2, lty = 2,
col = "red")
plot(eval_xgb_imp_w$roc,
add = TRUE, lwd = 2, lty = 3,
col = "blue")
legend("bottomright",
legend = c("XGB imputed (full)",
"XGB imputed (FS)",
"XGB imputed (weighted)"),
lty = c(1, 2, 3),
lwd = 2,
col = c("black", "red", "blue"))
In general, probability of having 10-year risk of CHD increases with an increase in each risk variable. As expected, 10-year CHD risk increased monotonically with age. In the youngest age group (33–41), the observed CHD rate was about 6%, compared with around 26% in the oldest group (58–70). Males, individuals with prevalent hypertension, and those with prior stroke or diabetes had substantially higher observed CHD risk.
Our XGBoost (imputed, feature-selected) model achieved the best AUC of about 0.74 on the test set, indicating moderate discriminative ability. Using a threshold of 0.2, the model obtained sensitivity of around 0.73 and specificity around 0.63 (balanced accuracy = 0.68). Thus, it correctly identifies most patients who will develop CHD, at the cost of a relatively low positive predictive value (around 0.26).
At the same time, the imputed, feature-selected model achieved similar test AUC (slightly higher) to the full-feature models, while using far fewer predictors. This suggests that most of the predictive signal is captured by age, systolic blood pressure, glucose, and cigarettes per day, and additional variables add little beyond noise.
Additionally, all imputed data models performed better than simply dropping all missing values, demonstrating that imputing missing values or adding a “Missing” category can improve performance.
Practically, the model provides a moderate fit to the data. It is useful as a risk stratification or screening tool to rule out low-risk individuals, but it is not accurate enough to be used as a stand-alone diagnostic classifier.