Random Forest and Logistic Models

Using Random Forest and Logistic Model to Predict Heart Disease

In this project, I will use two machine learning models, Random Forest and Logistic Regression, to predict heart disease. This is a simple model where I use all available variables to make the prediction. I won’t delve too deeply into the problem beyond the estimation part. In reality, establishing causal inference requires more subject expertise and looking beyond the data for analytics. However, the goal today is to demonstrate and practice running these two machine learning models.

Importing library in R

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.3

library(cowplot)

Warning: package 'cowplot' was built under R version 4.3.3

library(randomForest)

Warning: package 'randomForest' was built under R version 4.3.3

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.3.3

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

✔ broom        1.0.5     ✔ rsample      1.2.1
✔ dials        1.2.1     ✔ tibble       3.2.1
✔ dplyr        1.1.4     ✔ tidyr        1.3.1
✔ infer        1.0.7     ✔ tune         1.2.1
✔ modeldata    1.4.0     ✔ workflows    1.1.4
✔ parsnip      1.2.1     ✔ workflowsets 1.1.0
✔ purrr        1.0.2     ✔ yardstick    1.3.1
✔ recipes      1.1.0

Warning: package 'dials' was built under R version 4.3.3

Warning: package 'infer' was built under R version 4.3.3

Warning: package 'modeldata' was built under R version 4.3.3

Warning: package 'parsnip' was built under R version 4.3.3

Warning: package 'recipes' was built under R version 4.3.3

Warning: package 'rsample' was built under R version 4.3.3

Warning: package 'tune' was built under R version 4.3.3

Warning: package 'workflows' was built under R version 4.3.3

Warning: package 'workflowsets' was built under R version 4.3.3

Warning: package 'yardstick' was built under R version 4.3.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ dplyr::combine()       masks randomForest::combine()
✖ purrr::discard()       masks scales::discard()
✖ dplyr::filter()        masks stats::filter()
✖ dplyr::lag()           masks stats::lag()
✖ randomForest::margin() masks ggplot2::margin()
✖ recipes::step()        masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(ROCR)

Warning: package 'ROCR' was built under R version 4.3.3

Using UCI heart disease data set

I will be using UCI hear disease data base. It’s clean and ready to use.

rm(list=ls())
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)

Cleaning the data

I used this approach from Josh Starmer’s StatQuest YouTube video. Check out his machine learning videos on YouTube; they are the best source for learning it.

colnames(data) <- c(
  "age",
  "sex",# 0 = female, 1 = male
  "cp", # chest pain 
  # 1 = typical angina, 
  # 2 = atypical angina, 
  # 3 = non-anginal pain, 
  # 4 = asymptomatic
  "trestbps", # resting blood pressure (in mm Hg)
  "chol", # serum cholestoral in mg/dl
  "fbs",  # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
  "restecg", # resting electrocardiographic results
  # 1 = normal
  # 2 = having ST-T wave abnormality
  # 3 = showing probable or definite left ventricular hypertrophy
  "thalach", # maximum heart rate achieved
  "exang",   # exercise induced angina, 1 = yes, 0 = no
  "oldpeak", # ST depression induced by exercise relative to rest
  "slope", # the slope of the peak exercise ST segment 
  # 1 = upsloping 
  # 2 = flat 
  # 3 = downsloping 
  "ca", # number of major vessels (0-3) colored by fluoroscopy
  "thal", # this is short of thalium heart scan
  # 3 = normal (no cold spots)
  # 6 = fixed defect (cold spots during rest and exercise)
  # 7 = reversible defect (when cold spots only appear during exercise)
  "hd" # (the predicted attribute) - diagnosis of heart disease 
  # 0 if less than or equal to 50% diameter narrowing
  # 1 if greater than 50% diameter narrowing
)


str(data)

'data.frame':   303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
 $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
 $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

data[data == "?"] <- NA #some values are ? which we will make NA


data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)

#changing int variables as factor
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)

data$ca <- as.integer(data$ca) # since this column had "?"s in it (which
data$ca <- as.factor(data$ca)  # ...then convert the integers to factor levels

data$thal <- as.integer(data$thal) # "thal" also had "?"s in it.
data$thal <- as.factor(data$thal)

## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd) # Now convert to a factor

Data sample

head(data,10)

   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
1   63   M  1      145  233   1       2     150     0     2.3     3  0    6
2   67   M  4      160  286   0       2     108     1     1.5     2  3    3
3   67   M  4      120  229   0       2     129     1     2.6     2  2    7
4   37   M  3      130  250   0       0     187     0     3.5     3  0    3
5   41   F  2      130  204   0       2     172     0     1.4     1  0    3
6   56   M  2      120  236   0       0     178     0     0.8     1  0    3
7   62   F  4      140  268   0       2     160     0     3.6     3  2    3
8   57   F  4      120  354   0       0     163     1     0.6     1  0    3
9   63   M  4      130  254   0       2     147     0     1.4     2  1    7
10  53   M  4      140  203   1       2     155     1     3.1     3  0    7
          hd
1    Healthy
2  Unhealthy
3  Unhealthy
4    Healthy
5    Healthy
6    Healthy
7  Unhealthy
8    Healthy
9  Unhealthy
10 Unhealthy

Training and Testing set

Dividing the data into training and test sets, I used initial_split from the tidymodels library in R. It’s a really useful library with built-in modeling functions for machine learning. However, for this project, I will use the base glm function for the logistic model and the randomForest package for the Random Forest model. I still need to learn more about tidymodels myself.

set.seed(2059)
df <- data %>% drop_na()
split_data <- initial_split(df,strata = hd)
train <- training(split_data)
test <- testing(split_data)

Baseline Accuracy

table(df$hd)


  Healthy Unhealthy 
      160       137

print(160/297)

[1] 0.5387205

print(137/297)

[1] 0.4612795

Out of 297 people, 160 are healthy and 137 are unhealthy. By just guessing “healthy,” we can be correct 53% of the time, and by guessing “unhealthy,” we can be correct 46% of the time. Therefore, 53% is our baseline accuracy, which we will try to beat with our model.

Using Random Forest for Prediction

Using Random Forest: Note that I have used a loop to find the best tree and iteration values to fine-tune my model. Tuning is important to achieve better models. However, since Random Forest uses bagging and bootstrap techniques, the values on your device, mine, and any other may differ without the same seed. The larger the dataset, the more consistent the results will likely be.

set.seed(2059)
ntree_values <- seq(100,1000,by=100)
tree <- NA
iter <- NA
optimal <-Inf

for(i in 1:10){
  for(j in seq_along(ntree_values)){
  temp <- randomForest(hd~.,data=train, mtry=i,ntree=ntree_values[j])
  
   optimal_temp <- temp$err.rate[nrow(temp$err.rate),1]
   if (optimal_temp < optimal){
     tree <- ntree_values[j]
     iter <- i
     optimal = optimal_temp
   }
  }
}

paste('Best Tree is',tree)

[1] "Best Tree is 400"

paste('Best ntry is',iter)

[1] "Best ntry is 4"

The output of random forest using the best parameters is as follows.

rf_model <- randomForest(hd~.,data=train,iter=iter,ntree=tree)
rf_model


Call:
 randomForest(formula = hd ~ ., data = train, iter = iter, ntree = tree) 
               Type of random forest: classification
                     Number of trees: 400
No. of variables tried at each split: 3

        OOB estimate of  error rate: 13.96%
Confusion matrix:
          Healthy Unhealthy class.error
Healthy       105        15   0.1250000
Unhealthy      16        86   0.1568627

head(rf_model$err.rate)

           OOB   Healthy Unhealthy
[1,] 0.2469136 0.2439024 0.2500000
[2,] 0.2031250 0.1756757 0.2407407
[3,] 0.2125000 0.1978022 0.2318841
[4,] 0.2457143 0.2395833 0.2531646
[5,] 0.2340426 0.1844660 0.2941176
[6,] 0.2222222 0.1834862 0.2696629

While the tuning is already done, I am still going to plot the results to see the performance of the random forest after each tree is made.

error_data <- as.data.frame(rf_model$err.rate)

error_data <- error_data %>% pivot_longer(
  cols = c(OOB, Healthy, Unhealthy),
  names_to = 'Type',
  values_to = 'Value'
)
error_data$Tree <- rep(seq(1,tree),3)
head(error_data)

# A tibble: 6 × 3
  Type      Value  Tree
  <chr>     <dbl> <int>
1 OOB       0.247     1
2 Healthy   0.244     2
3 Unhealthy 0.25      3
4 OOB       0.203     4
5 Healthy   0.176     5
6 Unhealthy 0.241     6

rf <- randomForest(hd~.,data=train,iter=iter,ntree=1000)

error_df <- as.data.frame(rf$err.rate)

error_df <- error_df %>% pivot_longer(
  cols = c(OOB, Healthy, Unhealthy),
  names_to = 'Type',
  values_to = 'Value'
)
error_df$Tree <- rep(seq(1,1000),3)
ggplot(data = error_df, aes(x = Tree, y = Value)) +
  geom_line(aes(color = Type)) + theme_classic()

Apart from that, let’s examine which variable is the most important in predicting heart disease. Again, this analysis is based entirely on data and the model and should not be considered an expert opinion.

importance(rf_model)

         MeanDecreaseGini
age            10.3472424
sex             2.9684356
cp             13.2679669
trestbps        6.9378937
chol            8.5695532
fbs             0.8137009
restecg         1.6310451
thalach        13.6785569
exang           5.6381812
oldpeak        11.7820945
slope           4.2490423
ca             15.4895335
thal           13.3258250

varImpPlot(rf_model)

‘ca’, which is number of major vessels (0-3) colored by fluoroscopy, seems to be the most important variables, followed by ‘cp’ and ‘thalach’ .

rf_predictions <- predict(rf_model,test,type="response")

table(rf_predictions,test$hd)

              
rf_predictions Healthy Unhealthy
     Healthy        35        12
     Unhealthy       5        23

The accuracy of this model is given as below.

confusionMatrix <- table(rf_predictions,test$hd)
sum(diag(confusionMatrix)) / sum(confusionMatrix)

[1] 0.7733333

Logistic Regression

Now I will use logistic regression to build the model and make predictions. The logistic regression model is available in base R.

lr <- glm(data=train, hd~., family = binomial)
summary(lr)


Call:
glm(formula = hd ~ ., family = binomial, data = train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.006e+01  3.723e+00  -2.701  0.00692 ** 
age         -9.080e-03  2.907e-02  -0.312  0.75477    
sexM         2.141e+00  7.053e-01   3.036  0.00240 ** 
cp2          1.409e+00  9.825e-01   1.434  0.15144    
cp3          4.743e-03  8.068e-01   0.006  0.99531    
cp4          2.824e+00  8.622e-01   3.275  0.00106 ** 
trestbps     3.281e-02  1.499e-02   2.189  0.02861 *  
chol         9.827e-03  4.695e-03   2.093  0.03634 *  
fbs1        -6.201e-01  6.943e-01  -0.893  0.37180    
restecg1     1.336e+01  1.643e+03   0.008  0.99351    
restecg2     1.148e-01  4.984e-01   0.230  0.81787    
thalach     -1.695e-02  1.394e-02  -1.216  0.22411    
exang1       3.462e-01  5.404e-01   0.641  0.52171    
oldpeak      4.228e-01  2.939e-01   1.439  0.15025    
slope2       1.702e+00  6.233e-01   2.731  0.00632 ** 
slope3       1.688e+00  1.311e+00   1.288  0.19791    
ca1          2.837e+00  6.821e-01   4.160 3.19e-05 ***
ca2          3.800e+00  9.600e-01   3.958 7.56e-05 ***
ca3          1.379e+00  9.974e-01   1.383  0.16667    
thal6       -1.804e-01  1.025e+00  -0.176  0.86034    
thal7        1.452e+00  5.824e-01   2.493  0.01267 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 306.30  on 221  degrees of freedom
Residual deviance: 124.78  on 201  degrees of freedom
AIC: 166.78

Number of Fisher Scoring iterations: 15

Now let me make predictions with the logistic model. This model will predict the probability of heart disease. The value of any probability lies between 0 and 1. I need to carefully choose a threshold that provides the best accuracy for the model.

lr_predictions <-predict(lr,test,type = 'response')
ROCRpred <- prediction(lr_predictions,test$hd)
ROCRperf <- performance(ROCRpred, "tpr",'fpr')
plot(ROCRperf,colorize=T,print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.5,2))

threshold <- seq(0.1,1,0.1)
a<-c()
t=c()
for(i in threshold){
  lr_pred <- ifelse(lr_predictions <=i,"Pred_Healthy","Pred_Unhealthy")
  confusionmatrix <- table(lr_pred, test$hd)
  acc <- sum(diag(confusionmatrix)) / sum(confusionmatrix)
  a <- c(a,acc)
  t<- c(t,i)
}
print(a)

 [1] 0.7066667 0.7866667 0.7866667 0.7733333 0.8000000 0.7866667 0.8133333
 [8] 0.8000000 0.7733333 0.5333333

print(t)

 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

print(max(a))

[1] 0.8133333

The best accuracy is when the threshold is 0.7, given the threshold increase at the interval of 0.1.

lr_pred <- ifelse(lr_predictions <=0.7,"Pred_Healthy","Pred_Unhealthy")
confusionmatrix <- table(lr_pred, test$hd)
confusionmatrix

                
lr_pred          Healthy Unhealthy
  Pred_Healthy        36        10
  Pred_Unhealthy       4        25

However, I can see that many cases of heart disease are being wrongly identified as healthy. In this kind of scenario, maximizing accuracy may not be the best approach. Instead, we can check specificity and sensitivity to get the true positive rate or false negative rate.

Here, I will try a threshold of 0.5. The ROC curve above showed a nice trade-off, so this should be a good starting point. My goal is to create a model that does not predict unhealthy cases as healthy. I am more lenient with false positives, as predicting a healthy person as unhealthy is not as severe an issue.

lr_pred <- ifelse(lr_predictions <=0.5,"Pred_Healthy","Pred_Unhealthy")
confusionmatrix <- table(lr_pred, test$hd)
confusionmatrix

                
lr_pred          Healthy Unhealthy
  Pred_Healthy        32         7
  Pred_Unhealthy       8        28

The accuracy using a threshold of 0.5 goes down slightly, but we make significantly fewer errors in identifying unhealthy patients. Sometimes, accuracy may not be the best metric, and it depends on the context of what we want to achieve. Diagnosing a healthy person as unhealthy is less severe compared to not diagnosing an unhealthy person and potentially letting them suffer or die. Therefore, it is better to reduce the latter error even if it comes with a trade-off in overall accuracy.

So here, I used two models: Random Forest and Logistic Regression. Both demonstrated strong accuracy. I hope you can replicate these methods in your future work. Thanks for going through my post.