Background

This is the heart disease data set from the UCI machine learning repository. http://archive.ics.uci.edu/ml/index.php. For detailed information: https://archive-beta.ics.uci.edu/ml/datasets/heart+disease. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

Here I will use these variables to build a random forest model, to predict if he/ she have heart disease.

(Note: cleaning and train/test split is similar to the one in logistics regression)

Load Library

library(tidyverse)
library(ggplot2)
library(caret)
library(randomForest)

Load data

raw_data=read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE)

# Only 14 attributes used:
# 1. #3 (age) 
# 2. #4 (sex) # 0 = female, 1 = male
# 3. #9 (cp) # chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
# 4. #10 (trestbps) # resting blood pressure (in mm Hg)
# 5. #12 (chol) # serum cholesterol in mg/dl
# 6. #16 (fbs) # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
# 7. #19 (restecg) # resting electrocardiograph results Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
# 8. #32 (thalach) # maximum heart rate achieved
# 9. #38 (exang) # exercise induced angina (1 = yes; 0 = no)
# 10. #40 (oldpeak) # ST depression induced by exercise relative to rest
# 11. #41 (slope) # the slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
# 12. #44 (ca) # number of major vessels (0-3) colored by fluoroscopy
# 13. #51 (thal) # this is short of thalium heart scan: 3 = normal (no cold spots), 6 = fixed defect (cold spots during rest and exercise), 7 = reversible defect (when cold spots only appear during exercise)
# 14. #58 (num) (the predicted attribute) # diagnosis of heart disease: 0 if less than or equal to 50% diameter narrowing, 1 if greater than 50% diameter narrowing
summary(raw_data)
##        V1              V2               V3              V4       
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.44   Mean   :0.6799   Mean   :3.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##        V5              V6               V7               V8       
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :241.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.7   Mean   :0.1485   Mean   :0.9901   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##        V9              V10            V11            V12           
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Length:303        
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   Class :character  
##  Median :0.0000   Median :0.80   Median :2.000   Mode  :character  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.601                     
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000                     
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000                     
##      V13                 V14        
##  Length:303         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.9373  
##                     3rd Qu.:2.0000  
##                     Max.   :4.0000
str(raw_data)
## 'data.frame':    303 obs. of  14 variables:
##  $ V1 : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ V2 : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ V3 : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ V4 : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ V5 : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ V6 : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ V7 : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ V8 : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ V9 : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ V10: num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ V11: num  3 2 2 3 1 1 3 1 2 3 ...
##  $ V12: chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ V13: chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ V14: int  0 2 1 0 0 0 3 0 2 1 ...

Cleaning data

# Rename column
names(raw_data)=c("age","sex","cp","trestbps","chol","fbs", "restecg","thalach", "exang","oldpeak", "slope", "ca","thal","heart_disease")
# Adjust column type
raw_data$sex=as.factor(raw_data$sex)
raw_data$cp=as.factor(raw_data$cp)
raw_data$fbs=as.factor(raw_data$fbs)
raw_data$restecg=as.factor(raw_data$restecg)
raw_data$exang=as.factor(raw_data$exang)
raw_data$slope=as.factor(raw_data$slope)

# value =0 is considered to be no heart disease (0), value >0 is considered to have heart disease (1).
raw_data$heart_disease=as.factor(if_else(raw_data$heart_disease==0,"0","1"))
str(raw_data)
## 'data.frame':    303 obs. of  14 variables:
##  $ age          : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex          : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp           : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps     : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol         : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs          : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg      : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach      : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang        : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak      : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope        : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca           : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal         : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
# "ca" and "thal" contains "?" value. Here we check how many of them.
raw_data %>% filter(ca=="?"|thal=="?")
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
## 1  53   0  3      128  216   0       2     115     0     0.0     1 0.0    ?
## 2  52   1  3      138  223   0       0     169     0     0.0     1   ?  3.0
## 3  43   1  4      132  247   1       2     143     1     0.1     2   ?  7.0
## 4  52   1  4      128  204   1       0     156     1     1.0     2 0.0    ?
## 5  58   1  2      125  220   0       0     144     0     0.4     2   ?  7.0
## 6  38   1  3      138  175   0       0     173     0     0.0     1   ?  3.0
##   heart_disease
## 1             0
## 2             0
## 3             1
## 4             1
## 5             0
## 6             0
# There are 6 rows contain "?" in ca and thal columns. These "?" may be considered as NA. As the number is small compare to our dataset size, they can be dropped from analysis.
raw_data[raw_data=="?"]=NA
raw_data=drop_na(raw_data)
raw_data$ca=as.factor(as.integer(raw_data$ca))
raw_data$thal=as.factor(as.integer(raw_data$thal))
str(raw_data)
## 'data.frame':    297 obs. of  14 variables:
##  $ age          : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex          : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp           : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps     : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol         : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs          : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg      : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach      : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang        : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak      : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope        : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca           : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal         : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
##  $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...

View distribution

data=raw_data
xtabs(~ heart_disease + sex, data=data)
##              sex
## heart_disease   0   1
##             0  71  89
##             1  25 112
xtabs(~ heart_disease + cp, data=data)
##              cp
## heart_disease   1   2   3   4
##             0  16  40  65  39
##             1   7   9  18 103
xtabs(~ heart_disease + fbs, data=data)
##              fbs
## heart_disease   0   1
##             0 137  23
##             1 117  20
xtabs(~ heart_disease + restecg, data=data)
##              restecg
## heart_disease  0  1  2
##             0 92  1 67
##             1 55  3 79
xtabs(~ heart_disease + exang, data=data)
##              exang
## heart_disease   0   1
##             0 137  23
##             1  63  74
xtabs(~ heart_disease + slope, data=data)
##              slope
## heart_disease   1   2   3
##             0 103  48   9
##             1  36  89  12
xtabs(~ heart_disease + ca, data=data)
##              ca
## heart_disease   0   1   2   3
##             0 129  21   7   3
##             1  45  44  31  17
xtabs(~ heart_disease + thal, data=data)
##              thal
## heart_disease   3   6   7
##             0 127   6  27
##             1  37  12  88

The distribution look normal and no missing field for specific variables.

Split the clean data-set into training and testing

train_test_split=sample(nrow(data),nrow(data)*0.7,replace=FALSE)
train_data=data[train_test_split,]
test_data=data[-train_test_split,]

Build a Random Forest Model

(Note: proximity=True mean we want randomForest to return the proximity matrix)

set.seed(111)
rfm=randomForest(heart_disease~.,data=train_data,proximity=TRUE)
rfm
## 
## Call:
##  randomForest(formula = heart_disease ~ ., data = train_data,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.36%
## Confusion matrix:
##     0  1 class.error
## 0 100 18   0.1525424
## 1  20 69   0.2247191

(Note1: There are 13 variables. By default, randomForest() will set mtry = sqrt(13) = 3.6 rounded down = 3) (Note2: Also by default, randomForest() generates 500 trees) Here we find the OOB (Out-Of-Bag) error rate = 19.32%. This means 80.68% of sample is correctly classified by the Random Forest. To break it down, 14.29% of non-heart-disease person are mis-classified, and 25.26% of heart-disease person are mis-classified

Make a plot about the error rate pattern in these 500 trees.

rfm_df=data.frame(tree_no=1:nrow(rfm$err.rate),oob=rfm$err.rate[,"OOB"],no_heart_disease=rfm$err.rate[,"0"],heart_disease=rfm$err.rate[,"1"])
rfm_df_long=rfm_df %>% pivot_longer(-tree_no,names_to="type",values_to="error_rate")
ggplot(rfm_df_long)+geom_line(aes(x=tree_no,y=error_rate*100,group=type,color=type))+labs(title="Error rate against tree numbers")+xlab("Tree Number")+ylab("Error Rate (%)")+scale_color_manual(values=c("red","orange","green"))+ theme_classic()

Optimization by tree numbers

In general the error rate will drops as more trees included in the models. Here we can try a higher trees number, e.g. 2000 trees, to see if the error rate drops significantly.

set.seed(111)
rfm_2000=randomForest(heart_disease~.,data=train_data,proximity=TRUE,ntree=2000)
rfm_2000
## 
## Call:
##  randomForest(formula = heart_disease ~ ., data = train_data,      proximity = TRUE, ntree = 2000) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.36%
## Confusion matrix:
##     0  1 class.error
## 0 101 17   0.1440678
## 1  21 68   0.2359551

The OOB error rate is 20.77% which is similar to 500-trees result. The class error rate of no-heart-disease and heart-disease are 16.96% and 25.26% respectively. They are also no improvement compared to original 500-trees model. Therefore we can continue to use the original 500-trees model.

Optimization by tree nodes number

Here we can try a different trees nodes number, e.g. from 1 to 10, to see which one has the lowest OOB error rate.

node_no=c(1:10)
oob_by_node=c()
for (i in node_no){
  set.seed(111)
  rfm_node=randomForest(heart_disease~.,data=train_data,proximity=TRUE,mtry=i)
  oob_by_node[i]=rfm_node$err.rate[500,"OOB"]
}
oob_by_node
##  [1] 0.1787440 0.1932367 0.1835749 0.1835749 0.1690821 0.1690821 0.1787440
##  [8] 0.1835749 0.1835749 0.1932367

The result of using 1 to 10 nodes is quite uniform and no significant improvement observed. Therefore we can continue to use the default 3 nodes.

Result Verification with the Test Data

Finally, we check the random forest model by using the test data. This is used to counter check the model’s prediction with actual result.

test_data_pred=predict(rfm,test_data)
test_data$pred=as.factor(test_data_pred)
confusionMatrix(test_data$heart_disease,test_data$pred,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  3
##          1 13 35
##                                           
##                Accuracy : 0.8222          
##                  95% CI : (0.7274, 0.8948)
##     No Information Rate : 0.5778          
##     P-Value [Acc > NIR] : 7.172e-07       
##                                           
##                   Kappa : 0.6481          
##                                           
##  Mcnemar's Test P-Value : 0.02445         
##                                           
##             Sensitivity : 0.9211          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.7292          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.4222          
##          Detection Rate : 0.3889          
##    Detection Prevalence : 0.5333          
##       Balanced Accuracy : 0.8355          
##                                           
##        'Positive' Class : 1               
## 

The confusion matrix show an accuracy rate of 88% among 90 samples in test data. Sensitivity rate is 92% and Specificity rate is 85%. These values are so high. It suggests that the Random Forest Model provide a good prediction for patient heart disease.