Using Random Forest and Logistic Model to Predict Heart Disease
In this project, I will use two machine learning models, Random Forest and Logistic Regression, to predict heart disease. This is a simple model where I use all available variables to make the prediction. I won’t delve too deeply into the problem beyond the estimation part. In reality, establishing causal inference requires more subject expertise and looking beyond the data for analytics. However, the goal today is to demonstrate and practice running these two machine learning models.
Importing library in R
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
library(cowplot)
Warning: package 'cowplot' was built under R version 4.3.3
library(randomForest)
Warning: package 'randomForest' was built under R version 4.3.3
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':
margin
library(tidymodels)
Warning: package 'tidymodels' was built under R version 4.3.3
I used this approach from Josh Starmer’s StatQuest YouTube video. Check out his machine learning videos on YouTube; they are the best source for learning it.
colnames(data) <-c("age","sex",# 0 = female, 1 = male"cp", # chest pain # 1 = typical angina, # 2 = atypical angina, # 3 = non-anginal pain, # 4 = asymptomatic"trestbps", # resting blood pressure (in mm Hg)"chol", # serum cholestoral in mg/dl"fbs", # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE"restecg", # resting electrocardiographic results# 1 = normal# 2 = having ST-T wave abnormality# 3 = showing probable or definite left ventricular hypertrophy"thalach", # maximum heart rate achieved"exang", # exercise induced angina, 1 = yes, 0 = no"oldpeak", # ST depression induced by exercise relative to rest"slope", # the slope of the peak exercise ST segment # 1 = upsloping # 2 = flat # 3 = downsloping "ca", # number of major vessels (0-3) colored by fluoroscopy"thal", # this is short of thalium heart scan# 3 = normal (no cold spots)# 6 = fixed defect (cold spots during rest and exercise)# 7 = reversible defect (when cold spots only appear during exercise)"hd"# (the predicted attribute) - diagnosis of heart disease # 0 if less than or equal to 50% diameter narrowing# 1 if greater than 50% diameter narrowing)str(data)
data[data =="?"] <-NA#some values are ? which we will make NAdata[data$sex ==0,]$sex <-"F"data[data$sex ==1,]$sex <-"M"data$sex <-as.factor(data$sex)#changing int variables as factordata$cp <-as.factor(data$cp)data$fbs <-as.factor(data$fbs)data$restecg <-as.factor(data$restecg)data$exang <-as.factor(data$exang)data$slope <-as.factor(data$slope)data$ca <-as.integer(data$ca) # since this column had "?"s in it (whichdata$ca <-as.factor(data$ca) # ...then convert the integers to factor levelsdata$thal <-as.integer(data$thal) # "thal" also had "?"s in it.data$thal <-as.factor(data$thal)## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"data$hd <-ifelse(test=data$hd ==0, yes="Healthy", no="Unhealthy")data$hd <-as.factor(data$hd) # Now convert to a factor
Dividing the data into training and test sets, I used initial_split from the tidymodels library in R. It’s a really useful library with built-in modeling functions for machine learning. However, for this project, I will use the base glm function for the logistic model and the randomForest package for the Random Forest model. I still need to learn more about tidymodels myself.
set.seed(2059)df <- data %>%drop_na()split_data <-initial_split(df,strata = hd)train <-training(split_data)test <-testing(split_data)
Baseline Accuracy
table(df$hd)
Healthy Unhealthy
160 137
print(160/297)
[1] 0.5387205
print(137/297)
[1] 0.4612795
Out of 297 people, 160 are healthy and 137 are unhealthy. By just guessing “healthy,” we can be correct 53% of the time, and by guessing “unhealthy,” we can be correct 46% of the time. Therefore, 53% is our baseline accuracy, which we will try to beat with our model.
Using Random Forest for Prediction
Using Random Forest: Note that I have used a loop to find the best tree and iteration values to fine-tune my model. Tuning is important to achieve better models. However, since Random Forest uses bagging and bootstrap techniques, the values on your device, mine, and any other may differ without the same seed. The larger the dataset, the more consistent the results will likely be.
set.seed(2059)ntree_values <-seq(100,1000,by=100)tree <-NAiter <-NAoptimal <-Inffor(i in1:10){for(j inseq_along(ntree_values)){ temp <-randomForest(hd~.,data=train, mtry=i,ntree=ntree_values[j]) optimal_temp <- temp$err.rate[nrow(temp$err.rate),1]if (optimal_temp < optimal){ tree <- ntree_values[j] iter <- i optimal = optimal_temp } }}paste('Best Tree is',tree)
[1] "Best Tree is 400"
paste('Best ntry is',iter)
[1] "Best ntry is 4"
The output of random forest using the best parameters is as follows.
Call:
randomForest(formula = hd ~ ., data = train, iter = iter, ntree = tree)
Type of random forest: classification
Number of trees: 400
No. of variables tried at each split: 3
OOB estimate of error rate: 13.96%
Confusion matrix:
Healthy Unhealthy class.error
Healthy 105 15 0.1250000
Unhealthy 16 86 0.1568627
Apart from that, let’s examine which variable is the most important in predicting heart disease. Again, this analysis is based entirely on data and the model and should not be considered an expert opinion.
importance(rf_model)
MeanDecreaseGini
age 10.3472424
sex 2.9684356
cp 13.2679669
trestbps 6.9378937
chol 8.5695532
fbs 0.8137009
restecg 1.6310451
thalach 13.6785569
exang 5.6381812
oldpeak 11.7820945
slope 4.2490423
ca 15.4895335
thal 13.3258250
varImpPlot(rf_model)
‘ca’, which is number of major vessels (0-3) colored by fluoroscopy, seems to be the most important variables, followed by ‘cp’ and ‘thalach’ .
Now let me make predictions with the logistic model. This model will predict the probability of heart disease. The value of any probability lies between 0 and 1. I need to carefully choose a threshold that provides the best accuracy for the model.
However, I can see that many cases of heart disease are being wrongly identified as healthy. In this kind of scenario, maximizing accuracy may not be the best approach. Instead, we can check specificity and sensitivity to get the true positive rate or false negative rate.
Here, I will try a threshold of 0.5. The ROC curve above showed a nice trade-off, so this should be a good starting point. My goal is to create a model that does not predict unhealthy cases as healthy. I am more lenient with false positives, as predicting a healthy person as unhealthy is not as severe an issue.
The accuracy using a threshold of 0.5 goes down slightly, but we make significantly fewer errors in identifying unhealthy patients. Sometimes, accuracy may not be the best metric, and it depends on the context of what we want to achieve. Diagnosing a healthy person as unhealthy is less severe compared to not diagnosing an unhealthy person and potentially letting them suffer or die. Therefore, it is better to reduce the latter error even if it comes with a trade-off in overall accuracy.
So here, I used two models: Random Forest and Logistic Regression. Both demonstrated strong accuracy. I hope you can replicate these methods in your future work. Thanks for going through my post.