The goal of this project is to predict whether a patient has heart disease using clinical measurements from the Cleveland heart disease dataset. This is a binary classification problem because the target variable is coded as 0 for no disease and 1 for disease.
A useful model in this setting could help support early risk screening, prioritize follow up testing, and improve clinical decision support. This dataset is small, so the project should be treated as a proof of concept rather than a production ready medical model.
library(tidyverse)
library(caret)
library(randomForest)
library(nnet)
library(pROC)
library(knitr)
df <- read.csv("heart_disease_cleveland.csv", na.strings = c("", "NA", "?"))
The dataset contains 303 rows and 14 columns. There are 13 predictor variables and 1 target variable.
dim(df)
## [1] 303 14
str(df)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : num 1 1 1 1 0 1 0 0 1 1 ...
## $ cp : num 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : num 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : num 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : num 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : num 6 3 7 3 3 3 3 3 7 7 ...
## $ target : int 0 1 1 0 0 0 1 0 1 1 ...
head(df)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3
## target
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
For EDA, I keep the raw data. For modeling, I convert the target to a factor and handle missing values with median imputation. I kept the workflow intentionally simple so the code runs quickly and is easy to review.
missing_tbl <- data.frame(
variable = names(df),
missing_n = sapply(df, function(x) sum(is.na(x))),
missing_pct = round(sapply(df, function(x) mean(is.na(x)) * 100), 2)
)
kable(missing_tbl, caption = "Missing values by variable")
| variable | missing_n | missing_pct | |
|---|---|---|---|
| age | age | 0 | 0.00 |
| sex | sex | 0 | 0.00 |
| cp | cp | 0 | 0.00 |
| trestbps | trestbps | 0 | 0.00 |
| chol | chol | 0 | 0.00 |
| fbs | fbs | 0 | 0.00 |
| restecg | restecg | 0 | 0.00 |
| thalach | thalach | 0 | 0.00 |
| exang | exang | 0 | 0.00 |
| oldpeak | oldpeak | 0 | 0.00 |
| slope | slope | 0 | 0.00 |
| ca | ca | 4 | 1.32 |
| thal | thal | 2 | 0.66 |
| target | target | 0 | 0.00 |
Only ca and thal contain missing values,
and the missingness is very small. ca has 4 missing values,
or about 1.32 percent of the data, and thal has 2 missing
values, or about 0.66 percent. Since the dataset is already small,
imputation is a better choice than dropping rows.
model_df <- df %>%
mutate(target = factor(ifelse(target == 1, "yes", "no"), levels = c("yes", "no")))
split_index <- createDataPartition(model_df$target, p = 0.80, list = FALSE)
train <- model_df[split_index, ]
test <- model_df[-split_index, ]
pp_impute <- preProcess(train %>% select(-target), method = "medianImpute")
x_train <- predict(pp_impute, train %>% select(-target))
x_test <- predict(pp_impute, test %>% select(-target))
train_rf <- data.frame(x_train, target = train$target)
test_rf <- data.frame(x_test, target = test$target)
pp_scale <- preProcess(x_train, method = c("center", "scale"))
train_nn_x <- predict(pp_scale, x_train)
test_nn_x <- predict(pp_scale, x_test)
train_nn <- data.frame(train_nn_x, target = train$target)
test_nn <- data.frame(test_nn_x, target = test$target)
class_balance <- model_df %>%
count(target) %>%
mutate(pct = round(n / sum(n) * 100, 1))
kable(class_balance, caption = "Target distribution")
| target | n | pct |
|---|---|---|
| yes | 139 | 45.9 |
| no | 164 | 54.1 |
The classes are fairly balanced, with slightly more patients without disease than with disease. That is helpful because severe class imbalance is not a major concern here.
I focused on the main continuous variables: age, resting blood pressure, cholesterol, maximum heart rate, and ST depression.
num_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")
num_summary <- data.frame(
variable = num_vars,
mean = sapply(df[num_vars], function(x) round(mean(x, na.rm = TRUE), 2)),
median = sapply(df[num_vars], function(x) round(median(x, na.rm = TRUE), 2)),
sd = sapply(df[num_vars], function(x) round(sd(x, na.rm = TRUE), 2)),
min = sapply(df[num_vars], function(x) round(min(x, na.rm = TRUE), 2)),
max = sapply(df[num_vars], function(x) round(max(x, na.rm = TRUE), 2))
)
kable(num_summary, caption = "Summary statistics for continuous variables")
| variable | mean | median | sd | min | max | |
|---|---|---|---|---|---|---|
| age | age | 54.44 | 56.0 | 9.04 | 29 | 77.0 |
| trestbps | trestbps | 131.69 | 130.0 | 17.60 | 94 | 200.0 |
| chol | chol | 246.69 | 241.0 | 51.78 | 126 | 564.0 |
| thalach | thalach | 149.61 | 153.0 | 22.88 | 71 | 202.0 |
| oldpeak | oldpeak | 1.04 | 0.8 | 1.16 | 0 | 6.2 |
The center of the data looks reasonable for a heart disease
population. Age is centered in the mid 50s. Resting blood pressure is
centered around 130. Cholesterol has the widest spread, while
oldpeak is more right skewed.
df %>%
select(all_of(num_vars)) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(title = "Distributions of continuous variables", x = NULL, y = "Count")
Most variables look reasonable, but cholesterol and
oldpeak show some right skew. Maximum heart rate is more
spread out but still fairly smooth.
outlier_count <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
sum(x < lower | x > upper, na.rm = TRUE)
}
outlier_tbl <- data.frame(
variable = num_vars,
outliers = sapply(df[num_vars], outlier_count)
)
kable(outlier_tbl, caption = "Outlier counts using the 1.5 x IQR rule")
| variable | outliers | |
|---|---|---|
| age | age | 0 |
| trestbps | trestbps | 9 |
| chol | chol | 5 |
| thalach | thalach | 1 |
| oldpeak | oldpeak | 5 |
There are some outliers in resting blood pressure, cholesterol, and
oldpeak. I did not remove them because the dataset is small
and these values may be clinically meaningful rather than data entry
errors.
df %>%
select(all_of(num_vars)) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot(fill = "tomato", alpha = 0.7) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplots for continuous variables", x = NULL, y = NULL)
corr_mat <- cor(df, use = "pairwise.complete.obs")
target_corr <- data.frame(
variable = rownames(corr_mat),
correlation_with_target = round(corr_mat[, "target"], 3)
) %>%
filter(variable != "target") %>%
arrange(desc(abs(correlation_with_target)))
kable(target_corr, caption = "Correlation of predictors with target")
| variable | correlation_with_target | |
|---|---|---|
| thal | thal | 0.526 |
| ca | ca | 0.460 |
| exang | exang | 0.432 |
| oldpeak | oldpeak | 0.425 |
| thalach | thalach | -0.417 |
| cp | cp | 0.414 |
| slope | slope | 0.339 |
| sex | sex | 0.277 |
| age | age | 0.223 |
| restecg | restecg | 0.169 |
| trestbps | trestbps | 0.151 |
| chol | chol | 0.085 |
| fbs | fbs | 0.025 |
The strongest positive relationships with heart disease are
thal, ca, exang,
oldpeak, and cp. The strongest negative
relationship is thalach, which suggests that lower maximum
heart rate is associated with higher disease risk.
There is no sign of extreme multicollinearity. The strongest
correlation among predictors is between oldpeak and
slope, which is moderate rather than severe.
eda_df <- df %>%
mutate(
target = factor(target, levels = c(0, 1), labels = c("No Disease", "Disease")),
sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
exang = factor(exang, levels = c(0, 1), labels = c("No", "Yes")),
cp = factor(cp)
)
ggplot(eda_df, aes(x = thalach, y = oldpeak, color = target)) +
geom_point(size = 2, alpha = 0.8) +
theme_minimal() +
labs(
title = "Maximum heart rate vs oldpeak by target class",
x = "Maximum heart rate",
y = "Oldpeak",
color = "Target"
)
Patients with disease tend to appear more often at lower
thalach values and higher oldpeak values.
ggplot(eda_df, aes(x = cp, fill = target)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal() +
labs(
title = "Heart disease rate by chest pain type",
x = "Chest pain type",
y = "Percent",
fill = "Target"
)
Chest pain type 4 has a much higher disease share than the other chest pain categories.
eda_df %>%
select(sex, cp, exang, target) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar(fill = "darkcyan") +
facet_wrap(~ variable, scales = "free") +
coord_flip() +
theme_minimal() +
labs(title = "Distributions of selected categorical variables", x = NULL, y = "Count")
The sample is mostly male, chest pain type 4 is the most common category, and most patients do not report exercise induced angina.
Key patterns from the exploratory analysis:
ca
and thal.thal,
ca, exang, oldpeak,
cp, and thalach.oldpeak, higher ca,
and lower thalach appear more likely to have heart
disease.Because this is a medical classification problem, I used AUC as the main evaluation metric. AUC is useful because it evaluates how well the model separates the two classes across many classification thresholds. I also report accuracy, precision, recall, and F1 score to give a fuller view of performance.
get_metrics <- function(actual, predicted, prob_yes) {
cm <- confusionMatrix(predicted, actual, positive = "yes")
precision <- unname(cm$byClass["Pos Pred Value"])
recall <- unname(cm$byClass["Sensitivity"])
f1 <- ifelse(
is.na(precision) | is.na(recall) | (precision + recall == 0),
NA,
2 * precision * recall / (precision + recall)
)
data.frame(
accuracy = round(unname(cm$overall["Accuracy"]), 3),
precision = round(precision, 3),
recall = round(recall, 3),
f1 = round(f1, 3),
auc = round(as.numeric(roc(actual, prob_yes, levels = c("no", "yes"), quiet = TRUE)$auc), 3)
)
}
The two algorithms selected for experimentation are:
Random forest is a strong choice for tabular data and can capture nonlinear patterns with little preprocessing. Neural networks are more flexible, but they usually need more data and more tuning. Since this dataset is small, it is useful to compare them directly.
Objective / hypothesis: A basic random forest with standard settings should provide a strong baseline on this dataset.
What changes: ntree = 200 and
mtry = sqrt(p)
set.seed(622)
rf_exp1 <- randomForest(
target ~ .,
data = train_rf,
ntree = 200,
mtry = floor(sqrt(ncol(train_rf) - 1)),
importance = TRUE
)
rf1_prob <- predict(rf_exp1, test_rf, type = "prob")[, "yes"]
rf1_pred <- predict(rf_exp1, test_rf, type = "response")
rf1_metrics <- get_metrics(test_rf$target, rf1_pred, rf1_prob)
kable(rf1_metrics, caption = "Random Forest Experiment 1 results")
| accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|
| 0.864 | 0.852 | 0.852 | 0.852 | 0.933 |
Objective / hypothesis: Increasing the number of
trees and slightly changing mtry may improve stability and
AUC.
What changes: ntree = 500,
mtry = 4, and nodesize = 3
set.seed(622)
rf_exp2 <- randomForest(
target ~ .,
data = train_rf,
ntree = 500,
mtry = 4,
nodesize = 3,
importance = TRUE
)
rf2_prob <- predict(rf_exp2, test_rf, type = "prob")[, "yes"]
rf2_pred <- predict(rf_exp2, test_rf, type = "response")
rf2_metrics <- get_metrics(test_rf$target, rf2_pred, rf2_prob)
kable(rf2_metrics, caption = "Random Forest Experiment 2 results")
| accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|
| 0.864 | 0.852 | 0.852 | 0.852 | 0.937 |
rf_importance <- data.frame(
variable = rownames(importance(rf_exp2)),
importance = importance(rf_exp2)[, "MeanDecreaseGini"]
) %>%
arrange(desc(importance))
kable(head(rf_importance, 10), digits = 2, caption = "Top 10 random forest importance values")
| variable | importance | |
|---|---|---|
| cp | cp | 15.58 |
| ca | ca | 14.90 |
| thal | thal | 14.64 |
| oldpeak | oldpeak | 13.81 |
| thalach | thalach | 11.02 |
| trestbps | trestbps | 8.38 |
| chol | chol | 8.20 |
| age | age | 8.12 |
| slope | slope | 4.57 |
| exang | exang | 4.28 |
The most important variables indeed line up with the EDA findings.
Based on the EDA, I expected variables such as ca,
thal, cp, oldpeak, and
thalach to appear near the top.
For the neural network models, I used the same imputed training data but added centering and scaling because neural networks are more sensitive to variable scale.
nn_ctrl <- trainControl(
method = "none",
classProbs = TRUE,
summaryFunction = twoClassSummary
)
Objective / hypothesis: A smaller network with stronger regularization should provide a stable baseline and reduce overfitting.
What changes: size = 3,
decay = 0.10
set.seed(622)
nn_exp1 <- train(
target ~ .,
data = train_nn,
method = "nnet",
trControl = nn_ctrl,
tuneGrid = data.frame(size = 3, decay = 0.10),
metric = "ROC",
maxit = 300,
trace = FALSE
)
nn1_prob <- predict(nn_exp1, test_nn, type = "prob")[, "yes"]
nn1_pred <- predict(nn_exp1, test_nn, type = "raw")
nn1_metrics <- get_metrics(test_nn$target, nn1_pred, nn1_prob)
kable(nn1_metrics, caption = "Neural Network Experiment 1 results")
| accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|
| 0.831 | 0.793 | 0.852 | 0.821 | 0.888 |
Objective / hypothesis: A slightly larger hidden layer may capture more nonlinear structure and improve performance.
What changes: size = 5,
decay = 0.01
set.seed(622)
nn_exp2 <- train(
target ~ .,
data = train_nn,
method = "nnet",
trControl = nn_ctrl,
tuneGrid = data.frame(size = 5, decay = 0.01),
metric = "ROC",
maxit = 300,
trace = FALSE
)
nn2_prob <- predict(nn_exp2, test_nn, type = "prob")[, "yes"]
nn2_pred <- predict(nn_exp2, test_nn, type = "raw")
nn2_metrics <- get_metrics(test_nn$target, nn2_pred, nn2_prob)
kable(nn2_metrics, caption = "Neural Network Experiment 2 results")
| accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|
| 0.864 | 0.828 | 0.889 | 0.857 | 0.918 |
results_tbl <- bind_rows(
data.frame(
algorithm = "Random Forest",
experiment = "RF 1",
objective = "Baseline forest with standard settings",
rf1_metrics
),
data.frame(
algorithm = "Random Forest",
experiment = "RF 2",
objective = "Larger forest with adjusted mtry and nodesize",
rf2_metrics
),
data.frame(
algorithm = "Neural Network",
experiment = "NN 1",
objective = "Smaller network with stronger regularization",
nn1_metrics
),
data.frame(
algorithm = "Neural Network",
experiment = "NN 2",
objective = "Larger network with less regularization",
nn2_metrics
)
)
kable(results_tbl, caption = "Comparison of all experiments")
| algorithm | experiment | objective | accuracy | precision | recall | f1 | auc |
|---|---|---|---|---|---|---|---|
| Random Forest | RF 1 | Baseline forest with standard settings | 0.864 | 0.852 | 0.852 | 0.852 | 0.933 |
| Random Forest | RF 2 | Larger forest with adjusted mtry and nodesize | 0.864 | 0.852 | 0.852 | 0.852 | 0.937 |
| Neural Network | NN 1 | Smaller network with stronger regularization | 0.831 | 0.793 | 0.852 | 0.821 | 0.888 |
| Neural Network | NN 2 | Larger network with less regularization | 0.864 | 0.828 | 0.889 | 0.857 | 0.918 |
Based on this type of dataset, I would recommend random forest if the goal is the most accurate and stable results with minimal tuning. Random forest usually performs very well on small tabular datasets like this one and is less sensitive to preprocessing choices than a neural network.
After running the experiments, the best model is the RF2 based on the highest AUC and supported by the other metrics.
My reasons are:
That said, the neural network is still useful to test because it provides a good comparison and shows whether added model flexibility improves performance.
This project used the Cleveland heart disease dataset to predict
whether a patient has heart disease. The EDA showed that missing values
were minimal, the target classes were reasonably balanced, and the
strongest signals appeared to come from variables such as
thal, ca, exang,
oldpeak, cp, and thalach.
I tested two random forest models and two neural network models. The final recommendation is the RF2 model with the strongest AUC and the best balance of accuracy, precision, recall, and F1 score. On a dataset like this, random forest is usually the safer final choice because it is strong on tabular data, easier to tune, and easier to explain than a neural network.
From a business and healthcare perspective, a model like this could be used as a screening support tool to help identify higher risk patients earlier. It could improve prioritization for additional testing and support more efficient resource allocation. However, because the dataset is small and medical decisions are high stakes, the model should only be used as a support tool and not as a replacement for clinical judgment.