This is the heart disease data set from the UCI machine learning repository. http://archive.ics.uci.edu/ml/index.php. For detailed information: https://archive-beta.ics.uci.edu/ml/datasets/heart+disease. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.
Here I will use these variables to build a logistics regression model, to predict if he/ she have heart disease.
library(tidyverse)
library(ggplot2)
library(caret)
library(pROC)
raw_data=read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE)
# Only 14 attributes used:
# 1. #3 (age)
# 2. #4 (sex) # 0 = female, 1 = male
# 3. #9 (cp) # chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
# 4. #10 (trestbps) # resting blood pressure (in mm Hg)
# 5. #12 (chol) # serum cholesterol in mg/dl
# 6. #16 (fbs) # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
# 7. #19 (restecg) # resting electrocardiograph results Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
# 8. #32 (thalach) # maximum heart rate achieved
# 9. #38 (exang) # exercise induced angina (1 = yes; 0 = no)
# 10. #40 (oldpeak) # ST depression induced by exercise relative to rest
# 11. #41 (slope) # the slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
# 12. #44 (ca) # number of major vessels (0-3) colored by fluoroscopy
# 13. #51 (thal) # this is short of thalium heart scan: 3 = normal (no cold spots), 6 = fixed defect (cold spots during rest and exercise), 7 = reversible defect (when cold spots only appear during exercise)
# 14. #58 (num) (the predicted attribute) # diagnosis of heart disease: 0 if less than or equal to 50% diameter narrowing, 1 if greater than 50% diameter narrowing
summary(raw_data)
## V1 V2 V3 V4
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.44 Mean :0.6799 Mean :3.158 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## V5 V6 V7 V8
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## V9 V10 V11 V12
## Min. :0.0000 Min. :0.00 Min. :1.000 Length:303
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 Class :character
## Median :0.0000 Median :0.80 Median :2.000 Mode :character
## Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :1.0000 Max. :6.20 Max. :3.000
## V13 V14
## Length:303 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.9373
## 3rd Qu.:2.0000
## Max. :4.0000
str(raw_data)
## 'data.frame': 303 obs. of 14 variables:
## $ V1 : num 63 67 67 37 41 56 62 57 63 53 ...
## $ V2 : num 1 1 1 1 0 1 0 0 1 1 ...
## $ V3 : num 1 4 4 3 2 2 4 4 4 4 ...
## $ V4 : num 145 160 120 130 130 120 140 120 130 140 ...
## $ V5 : num 233 286 229 250 204 236 268 354 254 203 ...
## $ V6 : num 1 0 0 0 0 0 0 0 0 1 ...
## $ V7 : num 2 2 2 0 2 0 2 0 2 2 ...
## $ V8 : num 150 108 129 187 172 178 160 163 147 155 ...
## $ V9 : num 0 1 1 0 0 0 0 1 0 1 ...
## $ V10: num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ V11: num 3 2 2 3 1 1 3 1 2 3 ...
## $ V12: chr "0.0" "3.0" "2.0" "0.0" ...
## $ V13: chr "6.0" "3.0" "7.0" "3.0" ...
## $ V14: int 0 2 1 0 0 0 3 0 2 1 ...
# Rename column
names(raw_data)=c("age","sex","cp","trestbps","chol","fbs", "restecg","thalach", "exang","oldpeak", "slope", "ca","thal","heart_disease")
# Adjust column type
raw_data$sex=as.factor(raw_data$sex)
raw_data$cp=as.factor(raw_data$cp)
raw_data$fbs=as.factor(raw_data$fbs)
raw_data$restecg=as.factor(raw_data$restecg)
raw_data$exang=as.factor(raw_data$exang)
raw_data$slope=as.factor(raw_data$slope)
# value =0 is considered to be no heart disease (0), value >0 is considered to have heart disease (1).
raw_data$heart_disease=as.factor(if_else(raw_data$heart_disease==0,"0","1"))
str(raw_data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps : num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : chr "0.0" "3.0" "2.0" "0.0" ...
## $ thal : chr "6.0" "3.0" "7.0" "3.0" ...
## $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
# "ca" and "thal" contains "?" value. Here we check how many of them.
raw_data %>% filter(ca=="?"|thal=="?")
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 53 0 3 128 216 0 2 115 0 0.0 1 0.0 ?
## 2 52 1 3 138 223 0 0 169 0 0.0 1 ? 3.0
## 3 43 1 4 132 247 1 2 143 1 0.1 2 ? 7.0
## 4 52 1 4 128 204 1 0 156 1 1.0 2 0.0 ?
## 5 58 1 2 125 220 0 0 144 0 0.4 2 ? 7.0
## 6 38 1 3 138 175 0 0 173 0 0.0 1 ? 3.0
## heart_disease
## 1 0
## 2 0
## 3 1
## 4 1
## 5 0
## 6 0
# There are 6 rows contain "?" in ca and thal columns. These "?" may be considered as NA. As the number is small compare to our dataset size, they can be dropped from analysis.
raw_data[raw_data=="?"]=NA
raw_data=drop_na(raw_data)
raw_data$ca=as.factor(as.integer(raw_data$ca))
raw_data$thal=as.factor(as.integer(raw_data$thal))
str(raw_data)
## 'data.frame': 297 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps : num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ thal : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
## $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
data=raw_data
xtabs(~ heart_disease + sex, data=data)
## sex
## heart_disease 0 1
## 0 71 89
## 1 25 112
xtabs(~ heart_disease + cp, data=data)
## cp
## heart_disease 1 2 3 4
## 0 16 40 65 39
## 1 7 9 18 103
xtabs(~ heart_disease + fbs, data=data)
## fbs
## heart_disease 0 1
## 0 137 23
## 1 117 20
xtabs(~ heart_disease + restecg, data=data)
## restecg
## heart_disease 0 1 2
## 0 92 1 67
## 1 55 3 79
xtabs(~ heart_disease + exang, data=data)
## exang
## heart_disease 0 1
## 0 137 23
## 1 63 74
xtabs(~ heart_disease + slope, data=data)
## slope
## heart_disease 1 2 3
## 0 103 48 9
## 1 36 89 12
xtabs(~ heart_disease + ca, data=data)
## ca
## heart_disease 0 1 2 3
## 0 129 21 7 3
## 1 45 44 31 17
xtabs(~ heart_disease + thal, data=data)
## thal
## heart_disease 3 6 7
## 0 127 6 27
## 1 37 12 88
The distribution look normal and no missing field for specific variables.
train_test_split=sample(nrow(data),nrow(data)*0.7,replace=FALSE)
train_data=data[train_test_split,]
test_data=data[-train_test_split,]
lrm=glm(heart_disease~.,data=train_data,family="binomial")
summary(lrm)
##
## Call:
## glm(formula = heart_disease ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.40710 -0.45629 -0.07388 0.32519 2.87960
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.183e+00 3.740e+00 -1.920 0.054807 .
## age -6.991e-03 3.142e-02 -0.222 0.823925
## sex1 1.631e+00 6.699e-01 2.435 0.014874 *
## cp2 2.491e+00 1.089e+00 2.287 0.022187 *
## cp3 1.342e+00 9.194e-01 1.459 0.144479
## cp4 2.832e+00 9.353e-01 3.027 0.002466 **
## trestbps 1.577e-02 1.488e-02 1.060 0.289326
## chol 5.120e-03 5.819e-03 0.880 0.378906
## fbs1 -6.767e-01 7.834e-01 -0.864 0.387703
## restecg1 1.344e+01 1.272e+03 0.011 0.991568
## restecg2 5.627e-04 4.830e-01 0.001 0.999070
## thalach -1.671e-02 1.395e-02 -1.197 0.231144
## exang1 6.924e-01 5.480e-01 1.264 0.206351
## oldpeak 5.712e-01 2.853e-01 2.002 0.045260 *
## slope2 1.672e+00 5.871e-01 2.848 0.004399 **
## slope3 6.282e-01 1.048e+00 0.599 0.548856
## ca1 2.523e+00 6.862e-01 3.676 0.000237 ***
## ca2 3.261e+00 9.433e-01 3.457 0.000547 ***
## ca3 3.013e+00 1.299e+00 2.320 0.020348 *
## thal6 -2.237e-01 9.098e-01 -0.246 0.805749
## thal7 1.890e+00 5.458e-01 3.463 0.000535 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 286.38 on 206 degrees of freedom
## Residual deviance: 126.47 on 186 degrees of freedom
## AIC: 168.47
##
## Number of Fisher Scoring iterations: 15
Some interesting result found: sex1 (male) is more probable to heart disease. Number of major vessels colored by fluoroscopy is significantly important in heart disease.
roc(train_data$heart_disease,lrm$fitted.values, plot=TRUE,percent=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = train_data$heart_disease, predictor = lrm$fitted.values, percent = TRUE, plot = TRUE)
##
## Data: lrm$fitted.values in 109 controls (train_data$heart_disease 0) < 98 cases (train_data$heart_disease 1).
## Area under the curve: 94.2%
The curve show us a significant difference from random classifier (wild guess).And the AOC is 93.93% which is very good.
# Make a data frame about the train model
lrm_train_df=data.frame(pred_prob=lrm$fitted.values,actual=train_data$heart_disease)
lrm_train_df=lrm_train_df %>% arrange(pred_prob)
lrm_train_df$order=1:nrow(lrm_train_df)
# Here we consider probability < 0.5 as no heart disease, while probability >=0.5 as having heart disease.
lrm_train_df$pred_result=ifelse(lrm_train_df$pred_prob>=0.5,"Heart Disease","No Heart Disease")
lrm_train_df$actual_result=ifelse(lrm_train_df$actual==1,"Heart Disease","No Heart Disease")
ggplot(lrm_train_df)+geom_point(aes(x=order,y=pred_prob,color=actual_result))+ylab("Predicted Probaility")+labs(title="Logistic Regression Model for Heart Disease - Trained Dataset")+scale_color_manual(values=c("red","blue"))+theme_classic()
From the graph,red dots (representing heart disease) are shown on the top while blue dots (representing no heart disease) are shown on the bottom. Their segmentation is quite clear overall. Therefore this logistic regression model is quite sufficient to distinguish between have/have no heart disease.
# make predictions using hold out data (test)
test_data=test_data %>% mutate(actual=ifelse(heart_disease==1,"Heart Disease","No Heart Disease"))
test_data_pred=predict(lrm,newdata=test_data, type="response")
lrm_test_df=data.frame(pred_prob=test_data_pred,actual=test_data$actual)
lrm_test_df=lrm_test_df %>% arrange(pred_prob)
lrm_test_df$order=1:nrow(lrm_test_df)
# Same as above in train data, we consider the predicted test data's probability < 0.5 as no heart disease, while probability >=0.5 as having heart disease.
lrm_test_df$pred_result=ifelse(lrm_test_df$pred_prob>=0.5,"Heart Disease","No Heart Disease")
ggplot(lrm_test_df)+geom_point(aes(x=order,y=pred_prob,color=actual))+ylab("Predicted Probaility")+labs(title="Logistic Regression Model for Heart Disease - Test Dataset")+scale_color_manual(values=c("red","blue"))+theme_classic()
From the graph,red dots (representing heart disease) are shown on the top while blue dots (representing no heart disease) are shown on the bottom. Their segmentation is very clear which imply a good logistic regression model in predicting heart disease. Therefore, the test dataset is distinguished easily between have/have no heart disease.
# To quantify our test result, we make a confusion matrix.
lrm_test_cm=confusionMatrix(as.factor(lrm_test_df$actual),as.factor(lrm_test_df$pred_result),positive ="Heart Disease")
lrm_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 31 8
## No Heart Disease 6 45
##
## Accuracy : 0.8444
## 95% CI : (0.7528, 0.9123)
## No Information Rate : 0.5889
## P-Value [Acc > NIR] : 1.588e-07
##
## Kappa : 0.6813
##
## Mcnemar's Test P-Value : 0.7893
##
## Sensitivity : 0.8378
## Specificity : 0.8491
## Pos Pred Value : 0.7949
## Neg Pred Value : 0.8824
## Prevalence : 0.4111
## Detection Rate : 0.3444
## Detection Prevalence : 0.4333
## Balanced Accuracy : 0.8434
##
## 'Positive' Class : Heart Disease
##
Out of 90 samples in test dataset, True Positive=36, True Negative=42, False Positive=4, False Negative=8. The overall accuracy is 87%. This demonstrated a good prediction from the logistic regression model.
The model built from Logistics Regression is quite high in accuracy. Both train and test data demonstrated a satisfactory result in heart disease prediction.