This is the heart disease data set from the UCI machine learning repository. http://archive.ics.uci.edu/ml/index.php. For detailed information: https://archive-beta.ics.uci.edu/ml/datasets/heart+disease. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.
Here I will use these variables to build a random forest model, to predict if he/ she have heart disease.
(Note: cleaning and train/test split is similar to the one in logistics regression)
library(tidyverse)
library(ggplot2)
library(caret)
library(randomForest)
raw_data=read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE)
# Only 14 attributes used:
# 1. #3 (age)
# 2. #4 (sex) # 0 = female, 1 = male
# 3. #9 (cp) # chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
# 4. #10 (trestbps) # resting blood pressure (in mm Hg)
# 5. #12 (chol) # serum cholesterol in mg/dl
# 6. #16 (fbs) # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
# 7. #19 (restecg) # resting electrocardiograph results Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
# 8. #32 (thalach) # maximum heart rate achieved
# 9. #38 (exang) # exercise induced angina (1 = yes; 0 = no)
# 10. #40 (oldpeak) # ST depression induced by exercise relative to rest
# 11. #41 (slope) # the slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
# 12. #44 (ca) # number of major vessels (0-3) colored by fluoroscopy
# 13. #51 (thal) # this is short of thalium heart scan: 3 = normal (no cold spots), 6 = fixed defect (cold spots during rest and exercise), 7 = reversible defect (when cold spots only appear during exercise)
# 14. #58 (num) (the predicted attribute) # diagnosis of heart disease: 0 if less than or equal to 50% diameter narrowing, 1 if greater than 50% diameter narrowing
summary(raw_data)
## V1 V2 V3 V4
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.44 Mean :0.6799 Mean :3.158 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## V5 V6 V7 V8
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## V9 V10 V11 V12
## Min. :0.0000 Min. :0.00 Min. :1.000 Length:303
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 Class :character
## Median :0.0000 Median :0.80 Median :2.000 Mode :character
## Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :1.0000 Max. :6.20 Max. :3.000
## V13 V14
## Length:303 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.9373
## 3rd Qu.:2.0000
## Max. :4.0000
str(raw_data)
## 'data.frame': 303 obs. of 14 variables:
## $ V1 : num 63 67 67 37 41 56 62 57 63 53 ...
## $ V2 : num 1 1 1 1 0 1 0 0 1 1 ...
## $ V3 : num 1 4 4 3 2 2 4 4 4 4 ...
## $ V4 : num 145 160 120 130 130 120 140 120 130 140 ...
## $ V5 : num 233 286 229 250 204 236 268 354 254 203 ...
## $ V6 : num 1 0 0 0 0 0 0 0 0 1 ...
## $ V7 : num 2 2 2 0 2 0 2 0 2 2 ...
## $ V8 : num 150 108 129 187 172 178 160 163 147 155 ...
## $ V9 : num 0 1 1 0 0 0 0 1 0 1 ...
## $ V10: num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ V11: num 3 2 2 3 1 1 3 1 2 3 ...
## $ V12: chr "0.0" "3.0" "2.0" "0.0" ...
## $ V13: chr "6.0" "3.0" "7.0" "3.0" ...
## $ V14: int 0 2 1 0 0 0 3 0 2 1 ...
# Rename column
names(raw_data)=c("age","sex","cp","trestbps","chol","fbs", "restecg","thalach", "exang","oldpeak", "slope", "ca","thal","heart_disease")
# Adjust column type
raw_data$sex=as.factor(raw_data$sex)
raw_data$cp=as.factor(raw_data$cp)
raw_data$fbs=as.factor(raw_data$fbs)
raw_data$restecg=as.factor(raw_data$restecg)
raw_data$exang=as.factor(raw_data$exang)
raw_data$slope=as.factor(raw_data$slope)
# value =0 is considered to be no heart disease (0), value >0 is considered to have heart disease (1).
raw_data$heart_disease=as.factor(if_else(raw_data$heart_disease==0,"0","1"))
str(raw_data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps : num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : chr "0.0" "3.0" "2.0" "0.0" ...
## $ thal : chr "6.0" "3.0" "7.0" "3.0" ...
## $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
# "ca" and "thal" contains "?" value. Here we check how many of them.
raw_data %>% filter(ca=="?"|thal=="?")
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 53 0 3 128 216 0 2 115 0 0.0 1 0.0 ?
## 2 52 1 3 138 223 0 0 169 0 0.0 1 ? 3.0
## 3 43 1 4 132 247 1 2 143 1 0.1 2 ? 7.0
## 4 52 1 4 128 204 1 0 156 1 1.0 2 0.0 ?
## 5 58 1 2 125 220 0 0 144 0 0.4 2 ? 7.0
## 6 38 1 3 138 175 0 0 173 0 0.0 1 ? 3.0
## heart_disease
## 1 0
## 2 0
## 3 1
## 4 1
## 5 0
## 6 0
# There are 6 rows contain "?" in ca and thal columns. These "?" may be considered as NA. As the number is small compare to our dataset size, they can be dropped from analysis.
raw_data[raw_data=="?"]=NA
raw_data=drop_na(raw_data)
raw_data$ca=as.factor(as.integer(raw_data$ca))
raw_data$thal=as.factor(as.integer(raw_data$thal))
str(raw_data)
## 'data.frame': 297 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps : num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ thal : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
## $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
data=raw_data
xtabs(~ heart_disease + sex, data=data)
## sex
## heart_disease 0 1
## 0 71 89
## 1 25 112
xtabs(~ heart_disease + cp, data=data)
## cp
## heart_disease 1 2 3 4
## 0 16 40 65 39
## 1 7 9 18 103
xtabs(~ heart_disease + fbs, data=data)
## fbs
## heart_disease 0 1
## 0 137 23
## 1 117 20
xtabs(~ heart_disease + restecg, data=data)
## restecg
## heart_disease 0 1 2
## 0 92 1 67
## 1 55 3 79
xtabs(~ heart_disease + exang, data=data)
## exang
## heart_disease 0 1
## 0 137 23
## 1 63 74
xtabs(~ heart_disease + slope, data=data)
## slope
## heart_disease 1 2 3
## 0 103 48 9
## 1 36 89 12
xtabs(~ heart_disease + ca, data=data)
## ca
## heart_disease 0 1 2 3
## 0 129 21 7 3
## 1 45 44 31 17
xtabs(~ heart_disease + thal, data=data)
## thal
## heart_disease 3 6 7
## 0 127 6 27
## 1 37 12 88
The distribution look normal and no missing field for specific variables.
train_test_split=sample(nrow(data),nrow(data)*0.7,replace=FALSE)
train_data=data[train_test_split,]
test_data=data[-train_test_split,]
(Note: proximity=True mean we want randomForest to return the proximity matrix)
set.seed(111)
rfm=randomForest(heart_disease~.,data=train_data,proximity=TRUE)
rfm
##
## Call:
## randomForest(formula = heart_disease ~ ., data = train_data, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.36%
## Confusion matrix:
## 0 1 class.error
## 0 100 18 0.1525424
## 1 20 69 0.2247191
(Note1: There are 13 variables. By default, randomForest() will set mtry = sqrt(13) = 3.6 rounded down = 3) (Note2: Also by default, randomForest() generates 500 trees) Here we find the OOB (Out-Of-Bag) error rate = 19.32%. This means 80.68% of sample is correctly classified by the Random Forest. To break it down, 14.29% of non-heart-disease person are mis-classified, and 25.26% of heart-disease person are mis-classified
rfm_df=data.frame(tree_no=1:nrow(rfm$err.rate),oob=rfm$err.rate[,"OOB"],no_heart_disease=rfm$err.rate[,"0"],heart_disease=rfm$err.rate[,"1"])
rfm_df_long=rfm_df %>% pivot_longer(-tree_no,names_to="type",values_to="error_rate")
ggplot(rfm_df_long)+geom_line(aes(x=tree_no,y=error_rate*100,group=type,color=type))+labs(title="Error rate against tree numbers")+xlab("Tree Number")+ylab("Error Rate (%)")+scale_color_manual(values=c("red","orange","green"))+ theme_classic()
In general the error rate will drops as more trees included in the models. Here we can try a higher trees number, e.g. 2000 trees, to see if the error rate drops significantly.
set.seed(111)
rfm_2000=randomForest(heart_disease~.,data=train_data,proximity=TRUE,ntree=2000)
rfm_2000
##
## Call:
## randomForest(formula = heart_disease ~ ., data = train_data, proximity = TRUE, ntree = 2000)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.36%
## Confusion matrix:
## 0 1 class.error
## 0 101 17 0.1440678
## 1 21 68 0.2359551
The OOB error rate is 20.77% which is similar to 500-trees result. The class error rate of no-heart-disease and heart-disease are 16.96% and 25.26% respectively. They are also no improvement compared to original 500-trees model. Therefore we can continue to use the original 500-trees model.
Here we can try a different trees nodes number, e.g. from 1 to 10, to see which one has the lowest OOB error rate.
node_no=c(1:10)
oob_by_node=c()
for (i in node_no){
set.seed(111)
rfm_node=randomForest(heart_disease~.,data=train_data,proximity=TRUE,mtry=i)
oob_by_node[i]=rfm_node$err.rate[500,"OOB"]
}
oob_by_node
## [1] 0.1787440 0.1932367 0.1835749 0.1835749 0.1690821 0.1690821 0.1787440
## [8] 0.1835749 0.1835749 0.1932367
The result of using 1 to 10 nodes is quite uniform and no significant improvement observed. Therefore we can continue to use the default 3 nodes.
Finally, we check the random forest model by using the test data. This is used to counter check the model’s prediction with actual result.
test_data_pred=predict(rfm,test_data)
test_data$pred=as.factor(test_data_pred)
confusionMatrix(test_data$heart_disease,test_data$pred,positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39 3
## 1 13 35
##
## Accuracy : 0.8222
## 95% CI : (0.7274, 0.8948)
## No Information Rate : 0.5778
## P-Value [Acc > NIR] : 7.172e-07
##
## Kappa : 0.6481
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.9211
## Specificity : 0.7500
## Pos Pred Value : 0.7292
## Neg Pred Value : 0.9286
## Prevalence : 0.4222
## Detection Rate : 0.3889
## Detection Prevalence : 0.5333
## Balanced Accuracy : 0.8355
##
## 'Positive' Class : 1
##
The confusion matrix show an accuracy rate of 88% among 90 samples in test data. Sensitivity rate is 92% and Specificity rate is 85%. These values are so high. It suggests that the Random Forest Model provide a good prediction for patient heart disease.