Predicting Heart Disease from Routine Clinical Measurements: A Comparison of Classical and Tree-Based Methods
Statistics Final Project
Which routine clinical measurements most strongly predict the presence of heart disease, and how do classical statistical models (logistic regression, LDA, GAM) compare with modern tree-based methods (CART, random forest) on this prediction task?
Cardiovascular disease is the leading cause of death worldwide. Cardiologists routinely measure a small set of variables during a standard workup — blood pressure, cholesterol, ECG findings, peak heart rate during a stress test, ST depression on exercise. A natural and practical question is: how well do those routine measurements, taken together, predict whether a patient actually has angiographic heart disease, and which of them carry the most information?
This question is interesting both substantively (which measurements matter most clinically?) and methodologically (do the modern tree-based methods we learned at the end of the course actually outperform the classical models from the early chapters on a real, structured dataset?).
Cleveland Heart Disease Dataset from the UCI Machine Learning Repository, originally collected by Detrano et al. (1989) at the Cleveland Clinic Foundation.
processed.cleveland.data from the UCI
server (read by the Rmd automatically)ca (number of major vessels) or thal (thallium
stress test result).| Variable | Type | Description |
|---|---|---|
age |
numeric | Age in years |
sex |
binary | 1 = male, 0 = female |
cp |
categorical (4 levels) | Chest pain type: typical / atypical / non-anginal / asymptomatic |
trestbps |
numeric | Resting blood pressure (mm Hg) |
chol |
numeric | Serum cholesterol (mg/dl) |
fbs |
binary | Fasting blood sugar > 120 mg/dl |
restecg |
categorical (3 levels) | Resting ECG result |
thalach |
numeric | Maximum heart rate achieved during exercise |
exang |
binary | Exercise-induced angina |
oldpeak |
numeric | ST depression induced by exercise relative to rest |
slope |
categorical (3 levels) | Slope of the peak exercise ST segment |
ca |
integer 0–3 | Number of major vessels colored by fluoroscopy |
thal |
categorical (3 levels) | Thallium stress test: normal / fixed defect / reversible defect |
target |
binary | Outcome. Heart disease present (1) or absent (0) |
set.seed(2026) at the top of the analysis
for full reproducibility.The project handout required three method families. I picked at least one model from each, plus a baseline for the classical group.
| Family (per rubric) | Model | R function | Why |
|---|---|---|---|
| Classification | Logistic regression | glm(family = binomial) |
The course’s workhorse classifier; interpretable baseline. |
| Classification | Linear discriminant analysis | MASS::lda |
Alternative classical classifier with different assumptions (multivariate normal predictors within class). Useful contrast to logistic. |
| Nonlinear regression | Generalized additive model (GAM) | mgcv::gam with s() smooths on continuous
predictors |
Direct test of whether the continuous predictors enter linearly or nonlinearly. Smoothing parameters selected by REML. |
| Tree-based | CART | rpart with cost-complexity pruning |
Interpretable single tree. Useful but high-variance benchmark. |
| Tree-based | Random forest (500 trees) | randomForest, mtry tuned by CV |
Ensemble method; reduces tree variance; gives permutation-style variable importance for free. |
To answer the nonlinearity sub-question rigorously, I compared the GAM against the nested linear logistic model (same predictors, no smooths) using a likelihood ratio test:
anova(fit_logit_nested, fit_gam, test = "LRT")
This is the cleanest way to ask “do the smooth terms add anything statistically meaningful?”
| Model | CV AUC |
|---|---|
| LDA | 0.899 |
| Logistic Regression | 0.888 |
| Random Forest | 0.886 |
| GAM | 0.861 |
| CART | 0.785 |
LDA edges out the others on CV; the linear models and random forest are essentially tied. CART trails substantially.
| Model | AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|
| Random Forest | 0.956 | 0.881 | 0.926 | 0.844 |
| Logistic Regression | 0.955 | 0.898 | 0.963 | 0.844 |
| GAM | 0.953 | 0.915 | 0.963 | 0.875 |
| LDA | 0.950 | 0.932 | 0.963 | 0.906 |
| CART | 0.861 | 0.763 | 0.852 | 0.688 |
Four of the five models cluster at AUC ≈ 0.95. The single CART tree clearly lags. The random forest’s lead over logistic regression is 0.001, which is well within the noise of a 59-patient test set.
| Rank | Variable | Importance |
|---|---|---|
| 1 | oldpeak (ST depression on exercise) |
100.0 |
| 2 | thal (thallium stress test) |
96.3 |
| 3 | cp (chest pain type) |
93.0 |
| 4 | thalach (max heart rate) |
69.8 |
| 5 | ca (major vessels on fluoroscopy) |
67.4 |
| 6 | exang (exercise-induced angina) |
60.3 |
| 7 | slope (ST segment slope) |
57.6 |
| 8 | sex |
42.2 |
| 9 | age |
25.6 |
| 10 | trestbps (resting BP) |
18.7 |
The top five are consistent with what logistic regression coefficients (by Wald z-statistic) and the CART tree splits also identify as most informative.
| Statistic | Value |
|---|---|
| Likelihood ratio χ² | 4.54 |
| Effective extra degrees of freedom | ~2.7 |
| p-value | 0.078 |
This is not significant at the conventional 5% level.
The GAM’s effective degrees of freedom for individual smooths confirm this:
| Predictor | EDF |
|---|---|
age |
1.0 |
trestbps |
1.0 |
chol |
2.7 |
thalach |
1.0 |
oldpeak |
1.0 |
EDF ≈ 1 means the smooth collapsed to a straight line. Only
chol shows a mild bend, and even it is not individually
significant.
The five most informative routine measurements — and they agree across logistic regression, the CART splits, and random forest importance — are:
oldpeak — ST-segment depression
induced by exercise.thal — thallium stress test
result.cp — chest pain type. (In this
referred-clinic population, the “asymptomatic” group carries the highest
disease rate, because patients are referred for stress testing precisely
when other findings raise suspicion.)thalach — maximum heart rate achieved
during the stress test. Diseased patients plateau lower.ca — number of major vessels colored
on fluoroscopy.Every one of these is, biologically, a direct measurement of the heart under stress or of its blood supply. The variables a non-specialist might expect to dominate — age, resting blood pressure, cholesterol — are present in the model but rank toward the bottom of the importance list. Conditional on the stress-test measurements, they add relatively little.
No, not detectably on this dataset. The likelihood ratio test gives p = 0.078, and the GAM’s smooths for age, blood pressure, peak heart rate, and ST depression all collapse to EDF = 1 (straight lines). Only cholesterol shows a mild bend, and it is not significant.
This is itself an honest, non-trivial finding. Going in, I expected
nonlinearity in oldpeak (perhaps a threshold effect) or
thalach (perhaps a saturating effect). The data don’t
support that with 297 patients. A plain logistic regression is
essentially as good as the GAM, and on a larger dataset
nonlinear effects might emerge, but we cannot claim them here.
Marginally, and not significantly. Random forest tops the test-set AUC table at 0.956, but logistic regression (0.955), GAM (0.953), and LDA (0.950) are essentially tied. The single CART tree at 0.861 is clearly worse — bagging really does help here — but the ensemble’s edge over the linear models is well within the noise of a 59-patient test set.
The practical takeaway: for structured, low-dimensional clinical data, a well-fit logistic regression is hard to beat. The random forest’s main contribution is interpretive (the variable importance plot) rather than predictive.
mtry. A
more aggressive random forest tune (or nested CV) might shift
performance, though probably not by much.Every result in this document is produced by the accompanying R
Markdown file (heart_disease_analysis.Rmd) running on the
canonical UCI Cleveland data with set.seed(2026).
To reproduce:
install.packages(c(
"tidyverse", "caret", "MASS", "mgcv",
"rpart", "rpart.plot", "randomForest",
"pROC", "corrplot", "knitr"
))
rmarkdown::render("heart_disease_analysis.Rmd")
The Rmd pulls the data directly from the UCI repository, so no local file is required.