Assignment 4 - Heart Disease Classification Capstone

Problem Statement

The goal of this project is to predict whether a patient has heart disease using clinical measurements from the Cleveland heart disease dataset. This is a binary classification problem because the target variable is coded as 0 for no disease and 1 for disease.

A useful model in this setting could help support early risk screening, prioritize follow up testing, and improve clinical decision support. This dataset is small, so the project should be treated as a proof of concept rather than a production ready medical model.

Load Packages and Data

library(tidyverse)
library(caret)
library(randomForest)
library(nnet)
library(pROC)
library(knitr)

df <- read.csv("heart_disease_cleveland.csv", na.strings = c("", "NA", "?"))

Data Structure and Preparation

The dataset contains 303 rows and 14 columns. There are 13 predictor variables and 1 target variable.

dim(df)

## [1] 303  14

str(df)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : num  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
##  $ target  : int  0 1 1 0 0 0 1 0 1 1 ...

head(df)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   target
## 1      0
## 2      1
## 3      1
## 4      0
## 5      0
## 6      0

For EDA, I keep the raw data. For modeling, I convert the target to a factor and handle missing values with median imputation. I kept the workflow intentionally simple so the code runs quickly and is easy to review.

missing_tbl <- data.frame(
  variable = names(df),
  missing_n = sapply(df, function(x) sum(is.na(x))),
  missing_pct = round(sapply(df, function(x) mean(is.na(x)) * 100), 2)
)

kable(missing_tbl, caption = "Missing values by variable")

Missing values by variable
	variable	missing_n	missing_pct
age	age	0	0.00
sex	sex	0	0.00
cp	cp	0	0.00
trestbps	trestbps	0	0.00
chol	chol	0	0.00
fbs	fbs	0	0.00
restecg	restecg	0	0.00
thalach	thalach	0	0.00
exang	exang	0	0.00
oldpeak	oldpeak	0	0.00
slope	slope	0	0.00
ca	ca	4	1.32
thal	thal	2	0.66
target	target	0	0.00

Only ca and thal contain missing values, and the missingness is very small. ca has 4 missing values, or about 1.32 percent of the data, and thal has 2 missing values, or about 0.66 percent. Since the dataset is already small, imputation is a better choice than dropping rows.

model_df <- df %>%
  mutate(target = factor(ifelse(target == 1, "yes", "no"), levels = c("yes", "no")))

split_index <- createDataPartition(model_df$target, p = 0.80, list = FALSE)

train <- model_df[split_index, ]
test  <- model_df[-split_index, ]

pp_impute <- preProcess(train %>% select(-target), method = "medianImpute")

x_train <- predict(pp_impute, train %>% select(-target))
x_test  <- predict(pp_impute, test %>% select(-target))

train_rf <- data.frame(x_train, target = train$target)
test_rf  <- data.frame(x_test, target = test$target)

pp_scale <- preProcess(x_train, method = c("center", "scale"))

train_nn_x <- predict(pp_scale, x_train)
test_nn_x  <- predict(pp_scale, x_test)

train_nn <- data.frame(train_nn_x, target = train$target)
test_nn  <- data.frame(test_nn_x, target = test$target)

Exploratory Data Analysis

Class balance

class_balance <- model_df %>%
  count(target) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

kable(class_balance, caption = "Target distribution")

Target distribution
target	n	pct
yes	139	45.9
no	164	54.1

The classes are fairly balanced, with slightly more patients without disease than with disease. That is helpful because severe class imbalance is not a major concern here.

Central tendency and spread

I focused on the main continuous variables: age, resting blood pressure, cholesterol, maximum heart rate, and ST depression.

num_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")

num_summary <- data.frame(
  variable = num_vars,
  mean = sapply(df[num_vars], function(x) round(mean(x, na.rm = TRUE), 2)),
  median = sapply(df[num_vars], function(x) round(median(x, na.rm = TRUE), 2)),
  sd = sapply(df[num_vars], function(x) round(sd(x, na.rm = TRUE), 2)),
  min = sapply(df[num_vars], function(x) round(min(x, na.rm = TRUE), 2)),
  max = sapply(df[num_vars], function(x) round(max(x, na.rm = TRUE), 2))
)

kable(num_summary, caption = "Summary statistics for continuous variables")

Summary statistics for continuous variables
	variable	mean	median	sd	min	max
age	age	54.44	56.0	9.04	29	77.0
trestbps	trestbps	131.69	130.0	17.60	94	200.0
chol	chol	246.69	241.0	51.78	126	564.0
thalach	thalach	149.61	153.0	22.88	71	202.0
oldpeak	oldpeak	1.04	0.8	1.16	0	6.2

The center of the data looks reasonable for a heart disease population. Age is centered in the mid 50s. Resting blood pressure is centered around 130. Cholesterol has the widest spread, while oldpeak is more right skewed.

Overall distributions

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distributions of continuous variables", x = NULL, y = "Count")

Most variables look reasonable, but cholesterol and oldpeak show some right skew. Maximum heart rate is more spread out but still fairly smooth.

Outliers

outlier_count <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  sum(x < lower | x > upper, na.rm = TRUE)
}

outlier_tbl <- data.frame(
  variable = num_vars,
  outliers = sapply(df[num_vars], outlier_count)
)

kable(outlier_tbl, caption = "Outlier counts using the 1.5 x IQR rule")

Outlier counts using the 1.5 x IQR rule
	variable	outliers
age	age	0
trestbps	trestbps	9
chol	chol	5
thalach	thalach	1
oldpeak	oldpeak	5

There are some outliers in resting blood pressure, cholesterol, and oldpeak. I did not remove them because the dataset is small and these values may be clinically meaningful rather than data entry errors.

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(fill = "tomato", alpha = 0.7) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplots for continuous variables", x = NULL, y = NULL)

Correlation analysis

corr_mat <- cor(df, use = "pairwise.complete.obs")

target_corr <- data.frame(
  variable = rownames(corr_mat),
  correlation_with_target = round(corr_mat[, "target"], 3)
) %>%
  filter(variable != "target") %>%
  arrange(desc(abs(correlation_with_target)))

kable(target_corr, caption = "Correlation of predictors with target")

Correlation of predictors with target
	variable	correlation_with_target
thal	thal	0.526
ca	ca	0.460
exang	exang	0.432
oldpeak	oldpeak	0.425
thalach	thalach	-0.417
cp	cp	0.414
slope	slope	0.339
sex	sex	0.277
age	age	0.223
restecg	restecg	0.169
trestbps	trestbps	0.151
chol	chol	0.085
fbs	fbs	0.025

The strongest positive relationships with heart disease are thal, ca, exang, oldpeak, and cp. The strongest negative relationship is thalach, which suggests that lower maximum heart rate is associated with higher disease risk.

There is no sign of extreme multicollinearity. The strongest correlation among predictors is between oldpeak and slope, which is moderate rather than severe.

Relationships between variables

eda_df <- df %>%
  mutate(
    target = factor(target, levels = c(0, 1), labels = c("No Disease", "Disease")),
    sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
    exang = factor(exang, levels = c(0, 1), labels = c("No", "Yes")),
    cp = factor(cp)
  )

ggplot(eda_df, aes(x = thalach, y = oldpeak, color = target)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_minimal() +
  labs(
    title = "Maximum heart rate vs oldpeak by target class",
    x = "Maximum heart rate",
    y = "Oldpeak",
    color = "Target"
  )

Patients with disease tend to appear more often at lower thalach values and higher oldpeak values.

ggplot(eda_df, aes(x = cp, fill = target)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal() +
  labs(
    title = "Heart disease rate by chest pain type",
    x = "Chest pain type",
    y = "Percent",
    fill = "Target"
  )

Chest pain type 4 has a much higher disease share than the other chest pain categories.

Categorical variable distributions

eda_df %>%
  select(sex, cp, exang, target) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_bar(fill = "darkcyan") +
  facet_wrap(~ variable, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Distributions of selected categorical variables", x = NULL, y = "Count")

The sample is mostly male, chest pain type 4 is the most common category, and most patients do not report exercise induced angina.

EDA summary

Key patterns from the exploratory analysis:

Missing values are minimal and concentrated only in ca and thal.
The target classes are fairly balanced.
The strongest predictors appear to be thal, ca, exang, oldpeak, cp, and thalach.
Patients with higher oldpeak, higher ca, and lower thalach appear more likely to have heart disease.
This is clearly a classification problem, not a regression problem, because the outcome is binary.

Evaluation Metric

Because this is a medical classification problem, I used AUC as the main evaluation metric. AUC is useful because it evaluates how well the model separates the two classes across many classification thresholds. I also report accuracy, precision, recall, and F1 score to give a fuller view of performance.

get_metrics <- function(actual, predicted, prob_yes) {
  cm <- confusionMatrix(predicted, actual, positive = "yes")
  precision <- unname(cm$byClass["Pos Pred Value"])
  recall <- unname(cm$byClass["Sensitivity"])
  f1 <- ifelse(
    is.na(precision) | is.na(recall) | (precision + recall == 0),
    NA,
    2 * precision * recall / (precision + recall)
  )

  data.frame(
    accuracy = round(unname(cm$overall["Accuracy"]), 3),
    precision = round(precision, 3),
    recall = round(recall, 3),
    f1 = round(f1, 3),
    auc = round(as.numeric(roc(actual, prob_yes, levels = c("no", "yes"), quiet = TRUE)$auc), 3)
  )
}

Algorithm Selection

The two algorithms selected for experimentation are:

Random Forest
Neural Network

Random forest is a strong choice for tabular data and can capture nonlinear patterns with little preprocessing. Neural networks are more flexible, but they usually need more data and more tuning. Since this dataset is small, it is useful to compare them directly.

Random Forest Experiments

Experiment 1: Baseline random forest

Objective / hypothesis: A basic random forest with standard settings should provide a strong baseline on this dataset.

What changes: ntree = 200 and mtry = sqrt(p)

set.seed(622)
rf_exp1 <- randomForest(
  target ~ .,
  data = train_rf,
  ntree = 200,
  mtry = floor(sqrt(ncol(train_rf) - 1)),
  importance = TRUE
)

rf1_prob <- predict(rf_exp1, test_rf, type = "prob")[, "yes"]
rf1_pred <- predict(rf_exp1, test_rf, type = "response")
rf1_metrics <- get_metrics(test_rf$target, rf1_pred, rf1_prob)

kable(rf1_metrics, caption = "Random Forest Experiment 1 results")

Random Forest Experiment 1 results
accuracy	precision	recall	f1	auc
0.864	0.852	0.852	0.852	0.933

Experiment 2: Larger forest with a slightly different feature search

Objective / hypothesis: Increasing the number of trees and slightly changing mtry may improve stability and AUC.

What changes: ntree = 500, mtry = 4, and nodesize = 3

set.seed(622)
rf_exp2 <- randomForest(
  target ~ .,
  data = train_rf,
  ntree = 500,
  mtry = 4,
  nodesize = 3,
  importance = TRUE
)

rf2_prob <- predict(rf_exp2, test_rf, type = "prob")[, "yes"]
rf2_pred <- predict(rf_exp2, test_rf, type = "response")
rf2_metrics <- get_metrics(test_rf$target, rf2_pred, rf2_prob)

kable(rf2_metrics, caption = "Random Forest Experiment 2 results")

Random Forest Experiment 2 results
accuracy	precision	recall	f1	auc
0.864	0.852	0.852	0.852	0.937

Random forest variable importance

rf_importance <- data.frame(
  variable = rownames(importance(rf_exp2)),
  importance = importance(rf_exp2)[, "MeanDecreaseGini"]
) %>%
  arrange(desc(importance))

kable(head(rf_importance, 10), digits = 2, caption = "Top 10 random forest importance values")

Top 10 random forest importance values
	variable	importance
cp	cp	15.58
ca	ca	14.90
thal	thal	14.64
oldpeak	oldpeak	13.81
thalach	thalach	11.02
trestbps	trestbps	8.38
chol	chol	8.20
age	age	8.12
slope	slope	4.57
exang	exang	4.28

The most important variables indeed line up with the EDA findings. Based on the EDA, I expected variables such as ca, thal, cp, oldpeak, and thalach to appear near the top.

Neural Network Experiments

For the neural network models, I used the same imputed training data but added centering and scaling because neural networks are more sensitive to variable scale.

nn_ctrl <- trainControl(
  method = "none",
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

Experiment 1: Smaller neural network

Objective / hypothesis: A smaller network with stronger regularization should provide a stable baseline and reduce overfitting.

What changes: size = 3, decay = 0.10

set.seed(622)
nn_exp1 <- train(
  target ~ .,
  data = train_nn,
  method = "nnet",
  trControl = nn_ctrl,
  tuneGrid = data.frame(size = 3, decay = 0.10),
  metric = "ROC",
  maxit = 300,
  trace = FALSE
)

nn1_prob <- predict(nn_exp1, test_nn, type = "prob")[, "yes"]
nn1_pred <- predict(nn_exp1, test_nn, type = "raw")
nn1_metrics <- get_metrics(test_nn$target, nn1_pred, nn1_prob)

kable(nn1_metrics, caption = "Neural Network Experiment 1 results")

Neural Network Experiment 1 results
accuracy	precision	recall	f1	auc
0.831	0.793	0.852	0.821	0.888

Experiment 2: Larger neural network

Objective / hypothesis: A slightly larger hidden layer may capture more nonlinear structure and improve performance.

What changes: size = 5, decay = 0.01

set.seed(622)
nn_exp2 <- train(
  target ~ .,
  data = train_nn,
  method = "nnet",
  trControl = nn_ctrl,
  tuneGrid = data.frame(size = 5, decay = 0.01),
  metric = "ROC",
  maxit = 300,
  trace = FALSE
)

nn2_prob <- predict(nn_exp2, test_nn, type = "prob")[, "yes"]
nn2_pred <- predict(nn_exp2, test_nn, type = "raw")
nn2_metrics <- get_metrics(test_nn$target, nn2_pred, nn2_prob)

kable(nn2_metrics, caption = "Neural Network Experiment 2 results")

Neural Network Experiment 2 results
accuracy	precision	recall	f1	auc
0.864	0.828	0.889	0.857	0.918

Experiment Comparison

results_tbl <- bind_rows(
  data.frame(
    algorithm = "Random Forest",
    experiment = "RF 1",
    objective = "Baseline forest with standard settings",
    rf1_metrics
  ),
  data.frame(
    algorithm = "Random Forest",
    experiment = "RF 2",
    objective = "Larger forest with adjusted mtry and nodesize",
    rf2_metrics
  ),
  data.frame(
    algorithm = "Neural Network",
    experiment = "NN 1",
    objective = "Smaller network with stronger regularization",
    nn1_metrics
  ),
  data.frame(
    algorithm = "Neural Network",
    experiment = "NN 2",
    objective = "Larger network with less regularization",
    nn2_metrics
  )
)

kable(results_tbl, caption = "Comparison of all experiments")

Comparison of all experiments
algorithm	experiment	objective	accuracy	precision	recall	f1	auc
Random Forest	RF 1	Baseline forest with standard settings	0.864	0.852	0.852	0.852	0.933
Random Forest	RF 2	Larger forest with adjusted mtry and nodesize	0.864	0.852	0.852	0.852	0.937
Neural Network	NN 1	Smaller network with stronger regularization	0.831	0.793	0.852	0.821	0.888
Neural Network	NN 2	Larger network with less regularization	0.864	0.828	0.889	0.857	0.918

Answers to the Main Questions

Which algorithm is recommended to get more accurate results?

Based on this type of dataset, I would recommend random forest if the goal is the most accurate and stable results with minimal tuning. Random forest usually performs very well on small tabular datasets like this one and is less sensitive to preprocessing choices than a neural network.

After running the experiments, the best model is the RF2 based on the highest AUC and supported by the other metrics.

My reasons are:

The dataset is small
The variables are structured and tabular
Random forest handles nonlinear relationships well
Random forest usually needs less tuning than a neural network
Neural networks often perform best when the dataset is much larger

That said, the neural network is still useful to test because it provides a good comparison and shows whether added model flexibility improves performance.

Conclusion and Business Impact

This project used the Cleveland heart disease dataset to predict whether a patient has heart disease. The EDA showed that missing values were minimal, the target classes were reasonably balanced, and the strongest signals appeared to come from variables such as thal, ca, exang, oldpeak, cp, and thalach.

I tested two random forest models and two neural network models. The final recommendation is the RF2 model with the strongest AUC and the best balance of accuracy, precision, recall, and F1 score. On a dataset like this, random forest is usually the safer final choice because it is strong on tabular data, easier to tune, and easier to explain than a neural network.

From a business and healthcare perspective, a model like this could be used as a screening support tool to help identify higher risk patients earlier. It could improve prioritization for additional testing and support more efficient resource allocation. However, because the dataset is small and medical decisions are high stakes, the model should only be used as a support tool and not as a replacement for clinical judgment.