Project Report

Predicting Heart Disease from Routine Clinical Measurements: A Comparison of Classical and Tree-Based Methods

Statistics Final Project


1. Research Question

Which routine clinical measurements most strongly predict the presence of heart disease, and how do classical statistical models (logistic regression, LDA, GAM) compare with modern tree-based methods (CART, random forest) on this prediction task?

Why this question matters

Cardiovascular disease is the leading cause of death worldwide. Cardiologists routinely measure a small set of variables during a standard workup — blood pressure, cholesterol, ECG findings, peak heart rate during a stress test, ST depression on exercise. A natural and practical question is: how well do those routine measurements, taken together, predict whether a patient actually has angiographic heart disease, and which of them carry the most information?

This question is interesting both substantively (which measurements matter most clinically?) and methodologically (do the modern tree-based methods we learned at the end of the course actually outperform the classical models from the early chapters on a real, structured dataset?).

Two specific sub-questions

  1. Variable importance. Of the 13 routine clinical predictors, which carry the most predictive signal for heart disease, and is that ranking consistent across very different model families?
  2. Method comparison. Does the added flexibility of a generalized additive model (smooth nonlinear terms) or a random forest (nonparametric ensemble) translate into meaningfully better predictive performance than a plain logistic regression on this dataset?

2. Data

Source

Cleveland Heart Disease Dataset from the UCI Machine Learning Repository, originally collected by Detrano et al. (1989) at the Cleveland Clinic Foundation.

Sample

  • 303 patients in the original file.
  • 6 patients dropped for missing values on ca (number of major vessels) or thal (thallium stress test result).
  • 297 patients used in the analysis.
  • Outcome prevalence: 46.1% heart disease present, 53.9% absent — well-balanced classes.

Variables

Variable Type Description
age numeric Age in years
sex binary 1 = male, 0 = female
cp categorical (4 levels) Chest pain type: typical / atypical / non-anginal / asymptomatic
trestbps numeric Resting blood pressure (mm Hg)
chol numeric Serum cholesterol (mg/dl)
fbs binary Fasting blood sugar > 120 mg/dl
restecg categorical (3 levels) Resting ECG result
thalach numeric Maximum heart rate achieved during exercise
exang binary Exercise-induced angina
oldpeak numeric ST depression induced by exercise relative to rest
slope categorical (3 levels) Slope of the peak exercise ST segment
ca integer 0–3 Number of major vessels colored by fluoroscopy
thal categorical (3 levels) Thallium stress test: normal / fixed defect / reversible defect
target binary Outcome. Heart disease present (1) or absent (0)

3. Methods

Train / test protocol

  • 80 / 20 split stratified on the outcome: 238 patients for training, 59 held out for the final test.
  • 10-fold cross-validation within the training set for model tuning and selection.
  • The test set was touched exactly once, at the very end, to produce the final comparison.
  • Random seed: set.seed(2026) at the top of the analysis for full reproducibility.
  • Primary metric: AUC (area under the ROC curve). Secondary metrics: accuracy, sensitivity, specificity.

Models fit

The project handout required three method families. I picked at least one model from each, plus a baseline for the classical group.

Family (per rubric) Model R function Why
Classification Logistic regression glm(family = binomial) The course’s workhorse classifier; interpretable baseline.
Classification Linear discriminant analysis MASS::lda Alternative classical classifier with different assumptions (multivariate normal predictors within class). Useful contrast to logistic.
Nonlinear regression Generalized additive model (GAM) mgcv::gam with s() smooths on continuous predictors Direct test of whether the continuous predictors enter linearly or nonlinearly. Smoothing parameters selected by REML.
Tree-based CART rpart with cost-complexity pruning Interpretable single tree. Useful but high-variance benchmark.
Tree-based Random forest (500 trees) randomForest, mtry tuned by CV Ensemble method; reduces tree variance; gives permutation-style variable importance for free.

Inferential test for nonlinearity

To answer the nonlinearity sub-question rigorously, I compared the GAM against the nested linear logistic model (same predictors, no smooths) using a likelihood ratio test:

anova(fit_logit_nested, fit_gam, test = "LRT")

This is the cleanest way to ask “do the smooth terms add anything statistically meaningful?”


4. Results

4.1 Cross-validated AUC on training set

Model CV AUC
LDA 0.899
Logistic Regression 0.888
Random Forest 0.886
GAM 0.861
CART 0.785

LDA edges out the others on CV; the linear models and random forest are essentially tied. CART trails substantially.

4.2 Held-out test set (n = 59)

Model AUC Accuracy Sensitivity Specificity
Random Forest 0.956 0.881 0.926 0.844
Logistic Regression 0.955 0.898 0.963 0.844
GAM 0.953 0.915 0.963 0.875
LDA 0.950 0.932 0.963 0.906
CART 0.861 0.763 0.852 0.688

Four of the five models cluster at AUC ≈ 0.95. The single CART tree clearly lags. The random forest’s lead over logistic regression is 0.001, which is well within the noise of a 59-patient test set.

4.3 Variable importance (random forest, top 10, scaled to max = 100)

Rank Variable Importance
1 oldpeak (ST depression on exercise) 100.0
2 thal (thallium stress test) 96.3
3 cp (chest pain type) 93.0
4 thalach (max heart rate) 69.8
5 ca (major vessels on fluoroscopy) 67.4
6 exang (exercise-induced angina) 60.3
7 slope (ST segment slope) 57.6
8 sex 42.2
9 age 25.6
10 trestbps (resting BP) 18.7

The top five are consistent with what logistic regression coefficients (by Wald z-statistic) and the CART tree splits also identify as most informative.

4.4 Test for nonlinearity (GAM vs. nested logistic)

Statistic Value
Likelihood ratio χ² 4.54
Effective extra degrees of freedom ~2.7
p-value 0.078

This is not significant at the conventional 5% level.

The GAM’s effective degrees of freedom for individual smooths confirm this:

Predictor EDF
age 1.0
trestbps 1.0
chol 2.7
thalach 1.0
oldpeak 1.0

EDF ≈ 1 means the smooth collapsed to a straight line. Only chol shows a mild bend, and even it is not individually significant.


5. Interpretation

5.1 What predicts heart disease?

The five most informative routine measurements — and they agree across logistic regression, the CART splits, and random forest importance — are:

  1. oldpeak — ST-segment depression induced by exercise.
  2. thal — thallium stress test result.
  3. cp — chest pain type. (In this referred-clinic population, the “asymptomatic” group carries the highest disease rate, because patients are referred for stress testing precisely when other findings raise suspicion.)
  4. thalach — maximum heart rate achieved during the stress test. Diseased patients plateau lower.
  5. ca — number of major vessels colored on fluoroscopy.

Every one of these is, biologically, a direct measurement of the heart under stress or of its blood supply. The variables a non-specialist might expect to dominate — age, resting blood pressure, cholesterol — are present in the model but rank toward the bottom of the importance list. Conditional on the stress-test measurements, they add relatively little.

5.2 Does nonlinearity exist in these predictors?

No, not detectably on this dataset. The likelihood ratio test gives p = 0.078, and the GAM’s smooths for age, blood pressure, peak heart rate, and ST depression all collapse to EDF = 1 (straight lines). Only cholesterol shows a mild bend, and it is not significant.

This is itself an honest, non-trivial finding. Going in, I expected nonlinearity in oldpeak (perhaps a threshold effect) or thalach (perhaps a saturating effect). The data don’t support that with 297 patients. A plain logistic regression is essentially as good as the GAM, and on a larger dataset nonlinear effects might emerge, but we cannot claim them here.

5.3 Do tree-based methods beat the classical models?

Marginally, and not significantly. Random forest tops the test-set AUC table at 0.956, but logistic regression (0.955), GAM (0.953), and LDA (0.950) are essentially tied. The single CART tree at 0.861 is clearly worse — bagging really does help here — but the ensemble’s edge over the linear models is well within the noise of a 59-patient test set.

The practical takeaway: for structured, low-dimensional clinical data, a well-fit logistic regression is hard to beat. The random forest’s main contribution is interpretive (the variable importance plot) rather than predictive.

5.4 Three headline conclusions

  1. The most predictive routine measurements are ST depression on exercise, thallium stress test, chest pain type, peak heart rate, and major vessels on fluoroscopy — all direct functional measurements of the heart under load.
  2. All four “good” models (logistic, LDA, GAM, RF) achieve test-set AUC ≈ 0.95. The single CART tree is the only clear loser; everything else is tied within noise.
  3. We tested for nonlinearity and did not find it (LRT p = 0.078). On this dataset, logistic regression suffices — a useful negative result.

6. Limitations

  • Small test set (n = 59). Differences in AUC smaller than ~0.03 are not meaningful at this sample size.
  • Single hospital, late 1980s data. Generalization to modern, more diverse populations is not guaranteed.
  • No hyperparameter sweep beyond mtry. A more aggressive random forest tune (or nested CV) might shift performance, though probably not by much.
  • Discrimination only, not calibration. All models were compared on AUC, not on how well-calibrated their predicted probabilities are. For clinical deployment, calibration would matter at least as much.
  • Binary outcome. The original Detrano data have a 0–4 severity scale; I collapsed to binary (any disease vs. none) for comparability with the standard literature, at the cost of throwing away ordinal information.

7. Reproducibility

Every result in this document is produced by the accompanying R Markdown file (heart_disease_analysis.Rmd) running on the canonical UCI Cleveland data with set.seed(2026).

To reproduce:

install.packages(c(
  "tidyverse", "caret", "MASS", "mgcv",
  "rpart", "rpart.plot", "randomForest",
  "pROC", "corrplot", "knitr"
))
rmarkdown::render("heart_disease_analysis.Rmd")

The Rmd pulls the data directly from the UCI repository, so no local file is required.


8. References

  • Detrano, R. et al. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304–310.
  • Dua, D. and Graff, C. (2019). UCI Machine Learning Repository: Heart Disease Data Set. University of California, Irvine. https://archive.ics.uci.edu/ml/datasets/heart+disease
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.
  • Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman and Hall/CRC.