Preparing Data

For this example, I’m going to get a real data set from the U.S. machine learning repository. Specifically, I choosed the heart disease data set.Starting by making a variable called URL and set it to the location of the data. This is how we read the data set into our from the URL. The head function shows us the first six rows of data.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

data <- read.csv(url, header=FALSE)
head(data)
##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0
glimpse(data)
## Rows: 303
## Columns: 14
## $ V1  <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57,...
## $ V2  <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, ...
## $ V3  <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1, ...
## $ V4  <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130...
## $ V5  <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256...
## $ V6  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ V7  <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, ...
## $ V8  <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142...
## $ V9  <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ V10 <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6...
## $ V11 <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2, ...
## $ V12 <chr> "0.0", "3.0", "2.0", "0.0", "0.0", "0.0", "2.0", "0.0", "1.0", ...
## $ V13 <chr> "6.0", "3.0", "7.0", "3.0", "3.0", "3.0", "3.0", "3.0", "7.0", ...
## $ V14 <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, ...
table(is.na(data))
## 
## FALSE 
##  4242

Unfortunately, none of the columns are labeled.So we named the columns after the names that were listed on the U.S. website.

colnames(data) <- c(
  "age",
  "sex",
  "cp",
  "trestbps",
  "chol",
  "fbs",
  "restecg",
  "thalach",
  "exang",
  "oldpeak",
  "slope",
  "ca",
  "thal",
  "hd"
)

However, the Sture function, which describes the structure of the data, tell us that some of the columns are messed up. Right now, sex is a number, but it’s supposed to be a factor where zero represents female and one represents male. cp, a.k.a. chest pain, is also supposed to be a factor where levels 1,2,3 represent different types of pain and 4 represents no chest pain. ca and thal are correctly called factors, but one of the levels is question mark (?) when we needed to be NA.So data have to be cleaned first before using

data[data == "?"] <- NA
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"

data$sex <- as.factor(data$sex)
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)

str(data)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
##  $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
##  $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca) 
data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)

data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)
str(data)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
##  $ hd      : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...
nrow(data[is.na(data$ca) | is.na(data$thal),])
## [1] 6
data[is.na(data$ca) | is.na(data$thal),]
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope   ca thal
## 88   53   F  3      128  216   0       2     115     0     0.0     1    0 <NA>
## 167  52   M  3      138  223   0       0     169     0     0.0     1 <NA>    3
## 193  43   M  4      132  247   1       2     143     1     0.1     2 <NA>    7
## 267  52   M  4      128  204   1       0     156     1     1.0     2    0 <NA>
## 288  58   M  2      125  220   0       0     144     0     0.4     2 <NA>    7
## 303  38   M  3      138  175   0       0     173     0     0.0     1 <NA>    3
##            hd
## 88    Healthy
## 167   Healthy
## 193 Unhealthy
## 267 Unhealthy
## 288   Healthy
## 303   Healthy
data <- data[!(is.na(data$ca) | is.na(data$thal)),]

Data Splitting

The data is split into train data and test data using the caret package, where we perform random sampling based on hd and generate train data from 70% of the entire dataset and test data from 30% of the entire dataset.

set.seed(100)
trainIndex <- createDataPartition(data$hd, p = .7, 
                                  list = FALSE, 
                                  times = 1)
head(trainIndex)
##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         5
## [5,]         6
## [6,]         7
train<-data[trainIndex,]
test<-data[-trainIndex,]

##Stepwise Regression

The beginning of Regression by assigning 2 new variables (models) : null_model and full_model were stand for the model with does not have any predictors and all valid predictors. These models are our starting and ending points, stepwise regression all possible pairs of predictors are iterated into the model.We use the function glm(), with the binomial family (logistic regression). Both the null and full models are then used for the stepwise process using the function step() with forward direction.

null_model<-glm(hd~1,data=data,family='binomial')
summary(null_model)
## 
## Call:
## glm(formula = hd ~ 1, family = "binomial", data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.112  -1.112  -1.112   1.244   1.244  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.1552     0.1164  -1.333    0.182
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 409.95  on 296  degrees of freedom
## AIC: 411.95
## 
## Number of Fisher Scoring iterations: 3
full_model<-glm(hd~.,data=data,family='binomial')
summary(full_model)
## 
## Call:
## glm(formula = hd ~ ., family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0490  -0.4847  -0.1213   0.3039   2.9086  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.253978   2.960399  -2.113 0.034640 *  
## age         -0.023508   0.025122  -0.936 0.349402    
## sexM         1.670152   0.552486   3.023 0.002503 ** 
## cp2          1.448396   0.809136   1.790 0.073446 .  
## cp3          0.393353   0.700338   0.562 0.574347    
## cp4          2.373287   0.709094   3.347 0.000817 ***
## trestbps     0.027720   0.011748   2.359 0.018300 *  
## chol         0.004445   0.004091   1.087 0.277253    
## fbs1        -0.574079   0.592539  -0.969 0.332622    
## restecg1     1.000887   2.638393   0.379 0.704424    
## restecg2     0.486408   0.396327   1.227 0.219713    
## thalach     -0.019695   0.011717  -1.681 0.092781 .  
## exang1       0.653306   0.447445   1.460 0.144267    
## oldpeak      0.390679   0.239173   1.633 0.102373    
## slope2       1.302289   0.486197   2.679 0.007395 ** 
## slope3       0.606760   0.939324   0.646 0.518309    
## ca1          2.237444   0.514770   4.346 1.38e-05 ***
## ca2          3.271852   0.785123   4.167 3.08e-05 ***
## ca3          2.188715   0.928644   2.357 0.018428 *  
## thal6       -0.168439   0.810310  -0.208 0.835331    
## thal7        1.433319   0.440567   3.253 0.001141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 183.10  on 276  degrees of freedom
## AIC: 225.1
## 
## Number of Fisher Scoring iterations: 6
step_model <- step(null_model, 
                   scope = list(lower = null_model,
                                upper = full_model),
                   direction = "forward")
## Start:  AIC=411.95
## hd ~ 1
## 
##            Df Deviance    AIC
## + thal      2   323.39 329.39
## + cp        3   328.75 336.75
## + ca        3   333.93 341.93
## + oldpeak   1   350.48 354.48
## + thalach   1   351.97 355.97
## + exang     1   355.48 359.48
## + slope     2   365.16 371.16
## + sex       1   386.12 390.12
## + age       1   394.25 398.25
## + restecg   2   400.28 406.28
## + trestbps  1   402.88 406.88
## <none>          409.95 411.95
## + chol      1   408.03 412.03
## + fbs       1   409.94 413.94
## 
## Step:  AIC=329.39
## hd ~ thal
## 
##            Df Deviance    AIC
## + ca        3   268.65 280.65
## + cp        3   275.86 287.86
## + thalach   1   288.27 296.27
## + oldpeak   1   294.27 302.27
## + exang     1   295.84 303.84
## + slope     2   300.98 310.98
## + age       1   312.61 320.61
## + restecg   2   311.48 321.48
## + sex       1   320.64 328.64
## + trestbps  1   320.68 328.68
## + chol      1   320.89 328.89
## <none>          323.39 329.39
## + fbs       1   322.99 330.99
## 
## Step:  AIC=280.65
## hd ~ thal + ca
## 
##            Df Deviance    AIC
## + cp        3   230.40 248.40
## + exang     1   245.07 259.07
## + thalach   1   246.15 260.15
## + oldpeak   1   247.64 261.64
## + slope     2   246.33 262.33
## + restecg   2   259.55 275.55
## + sex       1   265.29 279.29
## + fbs       1   265.81 279.81
## + trestbps  1   266.15 280.15
## <none>          268.65 280.65
## + chol      1   267.97 281.97
## + age       1   268.36 282.36
## 
## Step:  AIC=248.4
## hd ~ thal + ca + cp
## 
##            Df Deviance    AIC
## + oldpeak   1   213.32 233.32
## + slope     2   212.54 234.54
## + thalach   1   221.44 241.44
## + exang     1   223.23 243.23
## + trestbps  1   226.04 246.04
## + restecg   2   224.44 246.44
## + sex       1   226.74 246.74
## <none>          230.40 248.40
## + chol      1   229.38 249.38
## + age       1   230.00 250.00
## + fbs       1   230.12 250.12
## 
## Step:  AIC=233.32
## hd ~ thal + ca + cp + oldpeak
## 
##            Df Deviance    AIC
## + slope     2   204.79 228.79
## + thalach   1   209.22 231.22
## + exang     1   209.42 231.42
## + sex       1   209.88 231.88
## + trestbps  1   209.97 231.97
## + restecg   2   208.94 232.94
## <none>          213.32 233.32
## + chol      1   212.72 234.72
## + age       1   213.05 235.05
## + fbs       1   213.17 235.17
## 
## Step:  AIC=228.79
## hd ~ thal + ca + cp + oldpeak + slope
## 
##            Df Deviance    AIC
## + sex       1   198.20 224.20
## + trestbps  1   201.03 227.03
## + exang     1   201.93 227.93
## <none>          204.79 228.79
## + thalach   1   203.34 229.34
## + restecg   2   201.52 229.52
## + chol      1   204.48 230.48
## + fbs       1   204.63 230.63
## + age       1   204.79 230.79
## 
## Step:  AIC=224.2
## hd ~ thal + ca + cp + oldpeak + slope + sex
## 
##            Df Deviance    AIC
## + trestbps  1   192.57 220.57
## + exang     1   194.97 222.97
## + thalach   1   195.92 223.92
## <none>          198.20 224.20
## + chol      1   196.71 224.71
## + restecg   2   195.02 225.02
## + age       1   197.94 225.94
## + fbs       1   197.97 225.97
## 
## Step:  AIC=220.57
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps
## 
##           Df Deviance    AIC
## + exang    1   189.76 219.76
## + thalach  1   189.78 219.78
## <none>         192.57 220.57
## + chol     1   191.53 221.53
## + fbs      1   191.85 221.85
## + restecg  2   190.51 222.51
## + age      1   192.56 222.56
## 
## Step:  AIC=219.76
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps + exang
## 
##           Df Deviance    AIC
## + thalach  1   187.74 219.74
## <none>         189.76 219.76
## + chol     1   188.74 220.74
## + fbs      1   188.85 220.85
## + age      1   189.75 221.75
## + restecg  2   187.77 221.77
## 
## Step:  AIC=219.74
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps + exang + 
##     thalach
## 
##           Df Deviance    AIC
## <none>         187.74 219.74
## + chol     1   186.40 220.40
## + fbs      1   186.88 220.88
## + age      1   187.29 221.29
## + restecg  2   185.76 221.76
summary(step_model)
## 
## Call:
## glm(formula = hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps + 
##     exang + thalach, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0614  -0.4830  -0.1201   0.3257   2.9129  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.51296    2.45370  -2.654 0.007946 ** 
## thal6       -0.37503    0.78094  -0.480 0.631064    
## thal7        1.37461    0.42845   3.208 0.001335 ** 
## ca1          2.16201    0.49319   4.384 1.17e-05 ***
## ca2          2.94081    0.72291   4.068 4.74e-05 ***
## ca3          2.00426    0.89728   2.234 0.025502 *  
## cp2          1.46339    0.79116   1.850 0.064359 .  
## cp3          0.30696    0.68387   0.449 0.653536    
## cp4          2.44156    0.69208   3.528 0.000419 ***
## oldpeak      0.44279    0.22944   1.930 0.053623 .  
## slope2       1.31410    0.47721   2.754 0.005893 ** 
## slope3       0.56249    0.91149   0.617 0.537160    
## sexM         1.55489    0.52222   2.977 0.002907 ** 
## trestbps     0.02493    0.01070   2.329 0.019861 *  
## exang1       0.63056    0.43943   1.435 0.151300    
## thalach     -0.01493    0.01071  -1.394 0.163450    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 409.95  on 296  degrees of freedom
## Residual deviance: 187.74  on 281  degrees of freedom
## AIC: 219.74
## 
## Number of Fisher Scoring iterations: 6

##Calculation of Probability After the model is created, we use the same dataset as a test dataset and try to predict the probability of hd. After the probabilities are generated, we need to classify the binary events (have or have not) based on the probability. The probability of 0.5 (50%) is commonly used as a benchmark for binary events, i.e., any probability above 50% is considered a “have” and vice versa

test$prob<-predict(step_model,test,type='response')
test$death_pred<-ifelse(test$prob>=0.5,1,0)

After the classification process, we can create the confusion matrix to see how accurately our model can predict the hd The confusion matrix below shows that 78 out of 89 observations are predicted correctly (87% accuracy). We can also see that the model predicts the number of people who have “hd” (43 correct predictions) better than the number of people havent (35 correct predictions).

table(test$death_pred,test$hd)
##    
##     Healthy Unhealthy
##   0      43         6
##   1       5        35

##ROC & AUC

Moreover, generating a ROC curve using the pROC package to see the specificity and sensitivity of the model. We can also generate the area below the curve (AUC) from the ROC curve.

plot(roc(test$hd,test$death_pred),col='red')
## Setting levels: control = Healthy, case = Unhealthy
## Setting direction: controls < cases

auc(roc(test$hd,test$death_pred))
## Setting levels: control = Healthy, case = Unhealthy
## Setting direction: controls < cases
## Area under the curve: 0.8747

##Regression Plot

test %>% 
  arrange(prob) %>% 
  mutate(rank=rank(prob),Event=ifelse(prob>=0.5,'have','have not')) %>% 
  ggplot(aes(rank,prob))+
  geom_point(aes(color=Event))+
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial")) +
  ggtitle('Stepwise Logistic Regression',subtitle='Predicting diagnosis of heart disease')
## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

To sum up, the correctness and AUC values for this model points out to that the Stepwise Logistic Regression model is a appropriate model for predict diagnosis heart disease