Problem Statement

The goal of this project is to predict whether a patient has heart disease using clinical measurements from the Cleveland heart disease dataset. This is a binary classification problem because the target variable is coded as 0 for no disease and 1 for disease.

A useful model in this setting could help support early risk screening, prioritize follow up testing, and improve clinical decision support. This dataset is small, so the project should be treated as a proof of concept rather than a production ready medical model.

Load Packages and Data

library(tidyverse)
library(caret)
library(randomForest)
library(nnet)
library(pROC)
library(knitr)

df <- read.csv("heart_disease_cleveland.csv", na.strings = c("", "NA", "?"))

Data Structure and Preparation

The dataset contains 303 rows and 14 columns. There are 13 predictor variables and 1 target variable.

dim(df)
## [1] 303  14
str(df)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : num  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
##  $ target  : int  0 1 1 0 0 0 1 0 1 1 ...
head(df)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   target
## 1      0
## 2      1
## 3      1
## 4      0
## 5      0
## 6      0

For EDA, I keep the raw data. For modeling, I convert the target to a factor and handle missing values with median imputation. I kept the workflow intentionally simple so the code runs quickly and is easy to review.

missing_tbl <- data.frame(
  variable = names(df),
  missing_n = sapply(df, function(x) sum(is.na(x))),
  missing_pct = round(sapply(df, function(x) mean(is.na(x)) * 100), 2)
)

kable(missing_tbl, caption = "Missing values by variable")
Missing values by variable
variable missing_n missing_pct
age age 0 0.00
sex sex 0 0.00
cp cp 0 0.00
trestbps trestbps 0 0.00
chol chol 0 0.00
fbs fbs 0 0.00
restecg restecg 0 0.00
thalach thalach 0 0.00
exang exang 0 0.00
oldpeak oldpeak 0 0.00
slope slope 0 0.00
ca ca 4 1.32
thal thal 2 0.66
target target 0 0.00

Only ca and thal contain missing values, and the missingness is very small. ca has 4 missing values, or about 1.32 percent of the data, and thal has 2 missing values, or about 0.66 percent. Since the dataset is already small, imputation is a better choice than dropping rows.

model_df <- df %>%
  mutate(target = factor(ifelse(target == 1, "yes", "no"), levels = c("yes", "no")))

split_index <- createDataPartition(model_df$target, p = 0.80, list = FALSE)

train <- model_df[split_index, ]
test  <- model_df[-split_index, ]

pp_impute <- preProcess(train %>% select(-target), method = "medianImpute")

x_train <- predict(pp_impute, train %>% select(-target))
x_test  <- predict(pp_impute, test %>% select(-target))

train_rf <- data.frame(x_train, target = train$target)
test_rf  <- data.frame(x_test, target = test$target)

pp_scale <- preProcess(x_train, method = c("center", "scale"))

train_nn_x <- predict(pp_scale, x_train)
test_nn_x  <- predict(pp_scale, x_test)

train_nn <- data.frame(train_nn_x, target = train$target)
test_nn  <- data.frame(test_nn_x, target = test$target)

Exploratory Data Analysis

Class balance

class_balance <- model_df %>%
  count(target) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

kable(class_balance, caption = "Target distribution")
Target distribution
target n pct
yes 139 45.9
no 164 54.1

The classes are fairly balanced, with slightly more patients without disease than with disease. That is helpful because severe class imbalance is not a major concern here.

Central tendency and spread

I focused on the main continuous variables: age, resting blood pressure, cholesterol, maximum heart rate, and ST depression.

num_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")

num_summary <- data.frame(
  variable = num_vars,
  mean = sapply(df[num_vars], function(x) round(mean(x, na.rm = TRUE), 2)),
  median = sapply(df[num_vars], function(x) round(median(x, na.rm = TRUE), 2)),
  sd = sapply(df[num_vars], function(x) round(sd(x, na.rm = TRUE), 2)),
  min = sapply(df[num_vars], function(x) round(min(x, na.rm = TRUE), 2)),
  max = sapply(df[num_vars], function(x) round(max(x, na.rm = TRUE), 2))
)

kable(num_summary, caption = "Summary statistics for continuous variables")
Summary statistics for continuous variables
variable mean median sd min max
age age 54.44 56.0 9.04 29 77.0
trestbps trestbps 131.69 130.0 17.60 94 200.0
chol chol 246.69 241.0 51.78 126 564.0
thalach thalach 149.61 153.0 22.88 71 202.0
oldpeak oldpeak 1.04 0.8 1.16 0 6.2

The center of the data looks reasonable for a heart disease population. Age is centered in the mid 50s. Resting blood pressure is centered around 130. Cholesterol has the widest spread, while oldpeak is more right skewed.

Overall distributions

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distributions of continuous variables", x = NULL, y = "Count")

Most variables look reasonable, but cholesterol and oldpeak show some right skew. Maximum heart rate is more spread out but still fairly smooth.

Outliers

outlier_count <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  sum(x < lower | x > upper, na.rm = TRUE)
}

outlier_tbl <- data.frame(
  variable = num_vars,
  outliers = sapply(df[num_vars], outlier_count)
)

kable(outlier_tbl, caption = "Outlier counts using the 1.5 x IQR rule")
Outlier counts using the 1.5 x IQR rule
variable outliers
age age 0
trestbps trestbps 9
chol chol 5
thalach thalach 1
oldpeak oldpeak 5

There are some outliers in resting blood pressure, cholesterol, and oldpeak. I did not remove them because the dataset is small and these values may be clinically meaningful rather than data entry errors.

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(fill = "tomato", alpha = 0.7) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplots for continuous variables", x = NULL, y = NULL)

Correlation analysis

corr_mat <- cor(df, use = "pairwise.complete.obs")

target_corr <- data.frame(
  variable = rownames(corr_mat),
  correlation_with_target = round(corr_mat[, "target"], 3)
) %>%
  filter(variable != "target") %>%
  arrange(desc(abs(correlation_with_target)))

kable(target_corr, caption = "Correlation of predictors with target")
Correlation of predictors with target
variable correlation_with_target
thal thal 0.526
ca ca 0.460
exang exang 0.432
oldpeak oldpeak 0.425
thalach thalach -0.417
cp cp 0.414
slope slope 0.339
sex sex 0.277
age age 0.223
restecg restecg 0.169
trestbps trestbps 0.151
chol chol 0.085
fbs fbs 0.025

The strongest positive relationships with heart disease are thal, ca, exang, oldpeak, and cp. The strongest negative relationship is thalach, which suggests that lower maximum heart rate is associated with higher disease risk.

There is no sign of extreme multicollinearity. The strongest correlation among predictors is between oldpeak and slope, which is moderate rather than severe.

Relationships between variables

eda_df <- df %>%
  mutate(
    target = factor(target, levels = c(0, 1), labels = c("No Disease", "Disease")),
    sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
    exang = factor(exang, levels = c(0, 1), labels = c("No", "Yes")),
    cp = factor(cp)
  )
ggplot(eda_df, aes(x = thalach, y = oldpeak, color = target)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "Maximum heart rate vs oldpeak by target class",
    x = "Maximum heart rate",
    y = "Oldpeak",
    color = "Target"
  )

Patients with disease tend to appear more often at lower thalach values and higher oldpeak values.

ggplot(eda_df, aes(x = cp, fill = target)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal() +
  labs(
    title = "Heart disease rate by chest pain type",
    x = "Chest pain type",
    y = "Percent",
    fill = "Target"
  )

Chest pain type 4 has a much higher disease share than the other chest pain categories.

Categorical variable distributions

eda_df %>%
  select(sex, cp, exang, target) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_bar(fill = "darkcyan") +
  facet_wrap(~ variable, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Distributions of selected categorical variables", x = NULL, y = "Count")

The sample is mostly male, chest pain type 4 is the most common category, and most patients do not report exercise induced angina.

EDA summary

Key patterns from the exploratory analysis:

  1. Missing values are minimal and concentrated only in ca and thal.
  2. The target classes are fairly balanced.
  3. The strongest predictors appear to be thal, ca, exang, oldpeak, cp, and thalach.
  4. Patients with higher oldpeak, higher ca, and lower thalach appear more likely to have heart disease.
  5. This is clearly a classification problem, not a regression problem, because the outcome is binary.

Evaluation Metric

Because this is a medical classification problem, I used AUC as the main evaluation metric. AUC is useful because it evaluates how well the model separates the two classes across many classification thresholds. I also report accuracy, precision, recall, and F1 score to give a fuller view of performance.

get_metrics <- function(actual, predicted, prob_yes) {
  cm <- confusionMatrix(predicted, actual, positive = "yes")
  precision <- unname(cm$byClass["Pos Pred Value"])
  recall <- unname(cm$byClass["Sensitivity"])
  f1 <- ifelse(
    is.na(precision) | is.na(recall) | (precision + recall == 0),
    NA,
    2 * precision * recall / (precision + recall)
  )

  data.frame(
    accuracy = round(unname(cm$overall["Accuracy"]), 3),
    precision = round(precision, 3),
    recall = round(recall, 3),
    f1 = round(f1, 3),
    auc = round(as.numeric(roc(actual, prob_yes, levels = c("no", "yes"), quiet = TRUE)$auc), 3)
  )
}

Algorithm Selection

The two algorithms selected for experimentation are:

Random forest is a strong choice for tabular data and can capture nonlinear patterns with little preprocessing. Neural networks are more flexible, but they usually need more data and more tuning. Since this dataset is small, it is useful to compare them directly.

Random Forest Experiments

Experiment 1: Baseline random forest

Objective / hypothesis: A basic random forest with standard settings should provide a strong baseline on this dataset.

What changes: ntree = 200 and mtry = sqrt(p)

set.seed(622)
rf_exp1 <- randomForest(
  target ~ .,
  data = train_rf,
  ntree = 200,
  mtry = floor(sqrt(ncol(train_rf) - 1)),
  importance = TRUE
)

rf1_prob <- predict(rf_exp1, test_rf, type = "prob")[, "yes"]
rf1_pred <- predict(rf_exp1, test_rf, type = "response")
rf1_metrics <- get_metrics(test_rf$target, rf1_pred, rf1_prob)

kable(rf1_metrics, caption = "Random Forest Experiment 1 results")
Random Forest Experiment 1 results
accuracy precision recall f1 auc
0.864 0.852 0.852 0.852 0.933

Random forest variable importance

rf_importance <- data.frame(
  variable = rownames(importance(rf_exp2)),
  importance = importance(rf_exp2)[, "MeanDecreaseGini"]
) %>%
  arrange(desc(importance))

kable(head(rf_importance, 10), digits = 2, caption = "Top 10 random forest importance values")
Top 10 random forest importance values
variable importance
cp cp 15.58
ca ca 14.90
thal thal 14.64
oldpeak oldpeak 13.81
thalach thalach 11.02
trestbps trestbps 8.38
chol chol 8.20
age age 8.12
slope slope 4.57
exang exang 4.28

The most important variables indeed line up with the EDA findings. Based on the EDA, I expected variables such as ca, thal, cp, oldpeak, and thalach to appear near the top.

Neural Network Experiments

For the neural network models, I used the same imputed training data but added centering and scaling because neural networks are more sensitive to variable scale.

nn_ctrl <- trainControl(
  method = "none",
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

Experiment 1: Smaller neural network

Objective / hypothesis: A smaller network with stronger regularization should provide a stable baseline and reduce overfitting.

What changes: size = 3, decay = 0.10

set.seed(622)
nn_exp1 <- train(
  target ~ .,
  data = train_nn,
  method = "nnet",
  trControl = nn_ctrl,
  tuneGrid = data.frame(size = 3, decay = 0.10),
  metric = "ROC",
  maxit = 300,
  trace = FALSE
)

nn1_prob <- predict(nn_exp1, test_nn, type = "prob")[, "yes"]
nn1_pred <- predict(nn_exp1, test_nn, type = "raw")
nn1_metrics <- get_metrics(test_nn$target, nn1_pred, nn1_prob)

kable(nn1_metrics, caption = "Neural Network Experiment 1 results")
Neural Network Experiment 1 results
accuracy precision recall f1 auc
0.831 0.793 0.852 0.821 0.888

Experiment 2: Larger neural network

Objective / hypothesis: A slightly larger hidden layer may capture more nonlinear structure and improve performance.

What changes: size = 5, decay = 0.01

set.seed(622)
nn_exp2 <- train(
  target ~ .,
  data = train_nn,
  method = "nnet",
  trControl = nn_ctrl,
  tuneGrid = data.frame(size = 5, decay = 0.01),
  metric = "ROC",
  maxit = 300,
  trace = FALSE
)

nn2_prob <- predict(nn_exp2, test_nn, type = "prob")[, "yes"]
nn2_pred <- predict(nn_exp2, test_nn, type = "raw")
nn2_metrics <- get_metrics(test_nn$target, nn2_pred, nn2_prob)

kable(nn2_metrics, caption = "Neural Network Experiment 2 results")
Neural Network Experiment 2 results
accuracy precision recall f1 auc
0.864 0.828 0.889 0.857 0.918

Experiment Comparison

results_tbl <- bind_rows(
  data.frame(
    algorithm = "Random Forest",
    experiment = "RF 1",
    objective = "Baseline forest with standard settings",
    rf1_metrics
  ),
  data.frame(
    algorithm = "Random Forest",
    experiment = "RF 2",
    objective = "Larger forest with adjusted mtry and nodesize",
    rf2_metrics
  ),
  data.frame(
    algorithm = "Neural Network",
    experiment = "NN 1",
    objective = "Smaller network with stronger regularization",
    nn1_metrics
  ),
  data.frame(
    algorithm = "Neural Network",
    experiment = "NN 2",
    objective = "Larger network with less regularization",
    nn2_metrics
  )
)

kable(results_tbl, caption = "Comparison of all experiments")
Comparison of all experiments
algorithm experiment objective accuracy precision recall f1 auc
Random Forest RF 1 Baseline forest with standard settings 0.864 0.852 0.852 0.852 0.933
Random Forest RF 2 Larger forest with adjusted mtry and nodesize 0.864 0.852 0.852 0.852 0.937
Neural Network NN 1 Smaller network with stronger regularization 0.831 0.793 0.852 0.821 0.888
Neural Network NN 2 Larger network with less regularization 0.864 0.828 0.889 0.857 0.918

Answers to the Main Questions

Conclusion and Business Impact

This project used the Cleveland heart disease dataset to predict whether a patient has heart disease. The EDA showed that missing values were minimal, the target classes were reasonably balanced, and the strongest signals appeared to come from variables such as thal, ca, exang, oldpeak, cp, and thalach.

I tested two random forest models and two neural network models. The final recommendation is the RF2 model with the strongest AUC and the best balance of accuracy, precision, recall, and F1 score. On a dataset like this, random forest is usually the safer final choice because it is strong on tabular data, easier to tune, and easier to explain than a neural network.

From a business and healthcare perspective, a model like this could be used as a screening support tool to help identify higher risk patients earlier. It could improve prioritization for additional testing and support more efficient resource allocation. However, because the dataset is small and medical decisions are high stakes, the model should only be used as a support tool and not as a replacement for clinical judgment.