Background

This is the heart disease data set from the UCI machine learning repository. http://archive.ics.uci.edu/ml/index.php. For detailed information: https://archive-beta.ics.uci.edu/ml/datasets/heart+disease. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

Here I will use these variables to build a logistics regression model, to predict if he/ she have heart disease.

Load Library

library(tidyverse)
library(ggplot2)
library(caret)
library(pROC)

Load data

raw_data=read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE)

# Only 14 attributes used:
# 1. #3 (age) 
# 2. #4 (sex) # 0 = female, 1 = male
# 3. #9 (cp) # chest pain: 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
# 4. #10 (trestbps) # resting blood pressure (in mm Hg)
# 5. #12 (chol) # serum cholesterol in mg/dl
# 6. #16 (fbs) # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
# 7. #19 (restecg) # resting electrocardiograph results Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
# 8. #32 (thalach) # maximum heart rate achieved
# 9. #38 (exang) # exercise induced angina (1 = yes; 0 = no)
# 10. #40 (oldpeak) # ST depression induced by exercise relative to rest
# 11. #41 (slope) # the slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
# 12. #44 (ca) # number of major vessels (0-3) colored by fluoroscopy
# 13. #51 (thal) # this is short of thalium heart scan: 3 = normal (no cold spots), 6 = fixed defect (cold spots during rest and exercise), 7 = reversible defect (when cold spots only appear during exercise)
# 14. #58 (num) (the predicted attribute) # diagnosis of heart disease: 0 if less than or equal to 50% diameter narrowing, 1 if greater than 50% diameter narrowing
summary(raw_data)
##        V1              V2               V3              V4       
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.44   Mean   :0.6799   Mean   :3.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##        V5              V6               V7               V8       
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :241.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.7   Mean   :0.1485   Mean   :0.9901   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##        V9              V10            V11            V12           
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Length:303        
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   Class :character  
##  Median :0.0000   Median :0.80   Median :2.000   Mode  :character  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.601                     
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000                     
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000                     
##      V13                 V14        
##  Length:303         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.9373  
##                     3rd Qu.:2.0000  
##                     Max.   :4.0000
str(raw_data)
## 'data.frame':    303 obs. of  14 variables:
##  $ V1 : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ V2 : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ V3 : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ V4 : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ V5 : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ V6 : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ V7 : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ V8 : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ V9 : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ V10: num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ V11: num  3 2 2 3 1 1 3 1 2 3 ...
##  $ V12: chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ V13: chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ V14: int  0 2 1 0 0 0 3 0 2 1 ...

Cleaning data

# Rename column
names(raw_data)=c("age","sex","cp","trestbps","chol","fbs", "restecg","thalach", "exang","oldpeak", "slope", "ca","thal","heart_disease")
# Adjust column type
raw_data$sex=as.factor(raw_data$sex)
raw_data$cp=as.factor(raw_data$cp)
raw_data$fbs=as.factor(raw_data$fbs)
raw_data$restecg=as.factor(raw_data$restecg)
raw_data$exang=as.factor(raw_data$exang)
raw_data$slope=as.factor(raw_data$slope)

# value =0 is considered to be no heart disease (0), value >0 is considered to have heart disease (1).
raw_data$heart_disease=as.factor(if_else(raw_data$heart_disease==0,"0","1"))
str(raw_data)
## 'data.frame':    303 obs. of  14 variables:
##  $ age          : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex          : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp           : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps     : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol         : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs          : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg      : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach      : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang        : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak      : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope        : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca           : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal         : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
# "ca" and "thal" contains "?" value. Here we check how many of them.
raw_data %>% filter(ca=="?"|thal=="?")
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
## 1  53   0  3      128  216   0       2     115     0     0.0     1 0.0    ?
## 2  52   1  3      138  223   0       0     169     0     0.0     1   ?  3.0
## 3  43   1  4      132  247   1       2     143     1     0.1     2   ?  7.0
## 4  52   1  4      128  204   1       0     156     1     1.0     2 0.0    ?
## 5  58   1  2      125  220   0       0     144     0     0.4     2   ?  7.0
## 6  38   1  3      138  175   0       0     173     0     0.0     1   ?  3.0
##   heart_disease
## 1             0
## 2             0
## 3             1
## 4             1
## 5             0
## 6             0
# There are 6 rows contain "?" in ca and thal columns. These "?" may be considered as NA. As the number is small compare to our dataset size, they can be dropped from analysis.
raw_data[raw_data=="?"]=NA
raw_data=drop_na(raw_data)
raw_data$ca=as.factor(as.integer(raw_data$ca))
raw_data$thal=as.factor(as.integer(raw_data$thal))
str(raw_data)
## 'data.frame':    297 obs. of  14 variables:
##  $ age          : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex          : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp           : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps     : num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol         : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs          : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg      : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach      : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang        : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak      : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope        : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca           : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal         : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
##  $ heart_disease: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...

View distribution

data=raw_data
xtabs(~ heart_disease + sex, data=data)
##              sex
## heart_disease   0   1
##             0  71  89
##             1  25 112
xtabs(~ heart_disease + cp, data=data)
##              cp
## heart_disease   1   2   3   4
##             0  16  40  65  39
##             1   7   9  18 103
xtabs(~ heart_disease + fbs, data=data)
##              fbs
## heart_disease   0   1
##             0 137  23
##             1 117  20
xtabs(~ heart_disease + restecg, data=data)
##              restecg
## heart_disease  0  1  2
##             0 92  1 67
##             1 55  3 79
xtabs(~ heart_disease + exang, data=data)
##              exang
## heart_disease   0   1
##             0 137  23
##             1  63  74
xtabs(~ heart_disease + slope, data=data)
##              slope
## heart_disease   1   2   3
##             0 103  48   9
##             1  36  89  12
xtabs(~ heart_disease + ca, data=data)
##              ca
## heart_disease   0   1   2   3
##             0 129  21   7   3
##             1  45  44  31  17
xtabs(~ heart_disease + thal, data=data)
##              thal
## heart_disease   3   6   7
##             0 127   6  27
##             1  37  12  88

The distribution look normal and no missing field for specific variables.

Split the clean data-set into training and testing

train_test_split=sample(nrow(data),nrow(data)*0.7,replace=FALSE)
train_data=data[train_test_split,]
test_data=data[-train_test_split,]

Build Logistic Regression Model using train data set

lrm=glm(heart_disease~.,data=train_data,family="binomial")
summary(lrm)
## 
## Call:
## glm(formula = heart_disease ~ ., family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.40710  -0.45629  -0.07388   0.32519   2.87960  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.183e+00  3.740e+00  -1.920 0.054807 .  
## age         -6.991e-03  3.142e-02  -0.222 0.823925    
## sex1         1.631e+00  6.699e-01   2.435 0.014874 *  
## cp2          2.491e+00  1.089e+00   2.287 0.022187 *  
## cp3          1.342e+00  9.194e-01   1.459 0.144479    
## cp4          2.832e+00  9.353e-01   3.027 0.002466 ** 
## trestbps     1.577e-02  1.488e-02   1.060 0.289326    
## chol         5.120e-03  5.819e-03   0.880 0.378906    
## fbs1        -6.767e-01  7.834e-01  -0.864 0.387703    
## restecg1     1.344e+01  1.272e+03   0.011 0.991568    
## restecg2     5.627e-04  4.830e-01   0.001 0.999070    
## thalach     -1.671e-02  1.395e-02  -1.197 0.231144    
## exang1       6.924e-01  5.480e-01   1.264 0.206351    
## oldpeak      5.712e-01  2.853e-01   2.002 0.045260 *  
## slope2       1.672e+00  5.871e-01   2.848 0.004399 ** 
## slope3       6.282e-01  1.048e+00   0.599 0.548856    
## ca1          2.523e+00  6.862e-01   3.676 0.000237 ***
## ca2          3.261e+00  9.433e-01   3.457 0.000547 ***
## ca3          3.013e+00  1.299e+00   2.320 0.020348 *  
## thal6       -2.237e-01  9.098e-01  -0.246 0.805749    
## thal7        1.890e+00  5.458e-01   3.463 0.000535 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 286.38  on 206  degrees of freedom
## Residual deviance: 126.47  on 186  degrees of freedom
## AIC: 168.47
## 
## Number of Fisher Scoring iterations: 15

Some interesting result found: sex1 (male) is more probable to heart disease. Number of major vessels colored by fluoroscopy is significantly important in heart disease.

Make a ROC (Receiver Operating Characteristic) curve about model

roc(train_data$heart_disease,lrm$fitted.values,  plot=TRUE,percent=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = train_data$heart_disease, predictor = lrm$fitted.values,     percent = TRUE, plot = TRUE)
## 
## Data: lrm$fitted.values in 109 controls (train_data$heart_disease 0) < 98 cases (train_data$heart_disease 1).
## Area under the curve: 94.2%

The curve show us a significant difference from random classifier (wild guess).And the AOC is 93.93% which is very good.

Make a visulization about model pattern

# Make a data frame about the train model
lrm_train_df=data.frame(pred_prob=lrm$fitted.values,actual=train_data$heart_disease)
lrm_train_df=lrm_train_df %>% arrange(pred_prob)
lrm_train_df$order=1:nrow(lrm_train_df)
# Here we consider probability < 0.5 as no heart disease, while probability >=0.5 as having heart disease.
lrm_train_df$pred_result=ifelse(lrm_train_df$pred_prob>=0.5,"Heart Disease","No Heart Disease")
lrm_train_df$actual_result=ifelse(lrm_train_df$actual==1,"Heart Disease","No Heart Disease")

ggplot(lrm_train_df)+geom_point(aes(x=order,y=pred_prob,color=actual_result))+ylab("Predicted Probaility")+labs(title="Logistic Regression Model for Heart Disease - Trained Dataset")+scale_color_manual(values=c("red","blue"))+theme_classic()

From the graph,red dots (representing heart disease) are shown on the top while blue dots (representing no heart disease) are shown on the bottom. Their segmentation is quite clear overall. Therefore this logistic regression model is quite sufficient to distinguish between have/have no heart disease.

Apply test model data to the model to check predication result

# make predictions using hold out data (test)
test_data=test_data %>% mutate(actual=ifelse(heart_disease==1,"Heart Disease","No Heart Disease"))
test_data_pred=predict(lrm,newdata=test_data, type="response")

Make a visulization about prediction pattern

lrm_test_df=data.frame(pred_prob=test_data_pred,actual=test_data$actual)
lrm_test_df=lrm_test_df %>% arrange(pred_prob)
lrm_test_df$order=1:nrow(lrm_test_df)
# Same as above in train data, we consider the predicted test data's probability < 0.5 as no heart disease, while probability >=0.5 as having heart disease.
lrm_test_df$pred_result=ifelse(lrm_test_df$pred_prob>=0.5,"Heart Disease","No Heart Disease")

ggplot(lrm_test_df)+geom_point(aes(x=order,y=pred_prob,color=actual))+ylab("Predicted Probaility")+labs(title="Logistic Regression Model for Heart Disease - Test Dataset")+scale_color_manual(values=c("red","blue"))+theme_classic()

From the graph,red dots (representing heart disease) are shown on the top while blue dots (representing no heart disease) are shown on the bottom. Their segmentation is very clear which imply a good logistic regression model in predicting heart disease. Therefore, the test dataset is distinguished easily between have/have no heart disease.

Making a Confusion Matrix for test data

# To quantify our test result, we make a confusion matrix.
lrm_test_cm=confusionMatrix(as.factor(lrm_test_df$actual),as.factor(lrm_test_df$pred_result),positive ="Heart Disease")
lrm_test_cm
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Heart Disease No Heart Disease
##   Heart Disease               31                8
##   No Heart Disease             6               45
##                                           
##                Accuracy : 0.8444          
##                  95% CI : (0.7528, 0.9123)
##     No Information Rate : 0.5889          
##     P-Value [Acc > NIR] : 1.588e-07       
##                                           
##                   Kappa : 0.6813          
##                                           
##  Mcnemar's Test P-Value : 0.7893          
##                                           
##             Sensitivity : 0.8378          
##             Specificity : 0.8491          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.4111          
##          Detection Rate : 0.3444          
##    Detection Prevalence : 0.4333          
##       Balanced Accuracy : 0.8434          
##                                           
##        'Positive' Class : Heart Disease   
## 

Out of 90 samples in test dataset, True Positive=36, True Negative=42, False Positive=4, False Negative=8. The overall accuracy is 87%. This demonstrated a good prediction from the logistic regression model.

Conclusion

The model built from Logistics Regression is quite high in accuracy. Both train and test data demonstrated a satisfactory result in heart disease prediction.