For this example, I’m going to get a real data set from the U.S. machine learning repository. Specifically, I choosed the heart disease data set.Starting by making a variable called URL and set it to the location of the data. This is how we read the data set into our from the URL. The head function shows us the first six rows of data.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
head(data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
glimpse(data)
## Rows: 303
## Columns: 14
## $ V1 <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57,...
## $ V2 <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, ...
## $ V3 <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1, ...
## $ V4 <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130...
## $ V5 <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256...
## $ V6 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ V7 <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, ...
## $ V8 <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142...
## $ V9 <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ V10 <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6...
## $ V11 <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2, ...
## $ V12 <chr> "0.0", "3.0", "2.0", "0.0", "0.0", "0.0", "2.0", "0.0", "1.0", ...
## $ V13 <chr> "6.0", "3.0", "7.0", "3.0", "3.0", "3.0", "3.0", "3.0", "7.0", ...
## $ V14 <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, ...
table(is.na(data))
##
## FALSE
## 4242
Unfortunately, none of the columns are labeled.So we named the columns after the names that were listed on the U.S. website.
colnames(data) <- c(
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal",
"hd"
)
However, the Sture function, which describes the structure of the data, tell us that some of the columns are messed up. Right now, sex is a number, but it’s supposed to be a factor where zero represents female and one represents male. cp, a.k.a. chest pain, is also supposed to be a factor where levels 1,2,3 represent different types of pain and 4 represents no chest pain. ca and thal are correctly called factors, but one of the levels is question mark (?) when we needed to be NA.So data have to be cleaned first before using
data[data == "?"] <- NA
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : chr "0.0" "3.0" "2.0" "0.0" ...
## $ thal : chr "6.0" "3.0" "7.0" "3.0" ...
## $ hd : int 0 2 1 0 0 0 3 0 2 1 ...
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)
data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)
str(data)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
## $ thal : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
## $ hd : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...
nrow(data[is.na(data$ca) | is.na(data$thal),])
## [1] 6
data[is.na(data$ca) | is.na(data$thal),]
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 88 53 F 3 128 216 0 2 115 0 0.0 1 0 <NA>
## 167 52 M 3 138 223 0 0 169 0 0.0 1 <NA> 3
## 193 43 M 4 132 247 1 2 143 1 0.1 2 <NA> 7
## 267 52 M 4 128 204 1 0 156 1 1.0 2 0 <NA>
## 288 58 M 2 125 220 0 0 144 0 0.4 2 <NA> 7
## 303 38 M 3 138 175 0 0 173 0 0.0 1 <NA> 3
## hd
## 88 Healthy
## 167 Healthy
## 193 Unhealthy
## 267 Unhealthy
## 288 Healthy
## 303 Healthy
data <- data[!(is.na(data$ca) | is.na(data$thal)),]
The data is split into train data and test data using the caret package, where we perform random sampling based on hd and generate train data from 70% of the entire dataset and test data from 30% of the entire dataset.
set.seed(100)
trainIndex <- createDataPartition(data$hd, p = .7,
list = FALSE,
times = 1)
head(trainIndex)
## Resample1
## [1,] 2
## [2,] 3
## [3,] 4
## [4,] 5
## [5,] 6
## [6,] 7
train<-data[trainIndex,]
test<-data[-trainIndex,]
##Stepwise Regression
The beginning of Regression by assigning 2 new variables (models) : null_model and full_model were stand for the model with does not have any predictors and all valid predictors. These models are our starting and ending points, stepwise regression all possible pairs of predictors are iterated into the model.We use the function glm(), with the binomial family (logistic regression). Both the null and full models are then used for the stepwise process using the function step() with forward direction.
null_model<-glm(hd~1,data=data,family='binomial')
summary(null_model)
##
## Call:
## glm(formula = hd ~ 1, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.112 -1.112 -1.112 1.244 1.244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1552 0.1164 -1.333 0.182
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 409.95 on 296 degrees of freedom
## AIC: 411.95
##
## Number of Fisher Scoring iterations: 3
full_model<-glm(hd~.,data=data,family='binomial')
summary(full_model)
##
## Call:
## glm(formula = hd ~ ., family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0490 -0.4847 -0.1213 0.3039 2.9086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.253978 2.960399 -2.113 0.034640 *
## age -0.023508 0.025122 -0.936 0.349402
## sexM 1.670152 0.552486 3.023 0.002503 **
## cp2 1.448396 0.809136 1.790 0.073446 .
## cp3 0.393353 0.700338 0.562 0.574347
## cp4 2.373287 0.709094 3.347 0.000817 ***
## trestbps 0.027720 0.011748 2.359 0.018300 *
## chol 0.004445 0.004091 1.087 0.277253
## fbs1 -0.574079 0.592539 -0.969 0.332622
## restecg1 1.000887 2.638393 0.379 0.704424
## restecg2 0.486408 0.396327 1.227 0.219713
## thalach -0.019695 0.011717 -1.681 0.092781 .
## exang1 0.653306 0.447445 1.460 0.144267
## oldpeak 0.390679 0.239173 1.633 0.102373
## slope2 1.302289 0.486197 2.679 0.007395 **
## slope3 0.606760 0.939324 0.646 0.518309
## ca1 2.237444 0.514770 4.346 1.38e-05 ***
## ca2 3.271852 0.785123 4.167 3.08e-05 ***
## ca3 2.188715 0.928644 2.357 0.018428 *
## thal6 -0.168439 0.810310 -0.208 0.835331
## thal7 1.433319 0.440567 3.253 0.001141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 183.10 on 276 degrees of freedom
## AIC: 225.1
##
## Number of Fisher Scoring iterations: 6
step_model <- step(null_model,
scope = list(lower = null_model,
upper = full_model),
direction = "forward")
## Start: AIC=411.95
## hd ~ 1
##
## Df Deviance AIC
## + thal 2 323.39 329.39
## + cp 3 328.75 336.75
## + ca 3 333.93 341.93
## + oldpeak 1 350.48 354.48
## + thalach 1 351.97 355.97
## + exang 1 355.48 359.48
## + slope 2 365.16 371.16
## + sex 1 386.12 390.12
## + age 1 394.25 398.25
## + restecg 2 400.28 406.28
## + trestbps 1 402.88 406.88
## <none> 409.95 411.95
## + chol 1 408.03 412.03
## + fbs 1 409.94 413.94
##
## Step: AIC=329.39
## hd ~ thal
##
## Df Deviance AIC
## + ca 3 268.65 280.65
## + cp 3 275.86 287.86
## + thalach 1 288.27 296.27
## + oldpeak 1 294.27 302.27
## + exang 1 295.84 303.84
## + slope 2 300.98 310.98
## + age 1 312.61 320.61
## + restecg 2 311.48 321.48
## + sex 1 320.64 328.64
## + trestbps 1 320.68 328.68
## + chol 1 320.89 328.89
## <none> 323.39 329.39
## + fbs 1 322.99 330.99
##
## Step: AIC=280.65
## hd ~ thal + ca
##
## Df Deviance AIC
## + cp 3 230.40 248.40
## + exang 1 245.07 259.07
## + thalach 1 246.15 260.15
## + oldpeak 1 247.64 261.64
## + slope 2 246.33 262.33
## + restecg 2 259.55 275.55
## + sex 1 265.29 279.29
## + fbs 1 265.81 279.81
## + trestbps 1 266.15 280.15
## <none> 268.65 280.65
## + chol 1 267.97 281.97
## + age 1 268.36 282.36
##
## Step: AIC=248.4
## hd ~ thal + ca + cp
##
## Df Deviance AIC
## + oldpeak 1 213.32 233.32
## + slope 2 212.54 234.54
## + thalach 1 221.44 241.44
## + exang 1 223.23 243.23
## + trestbps 1 226.04 246.04
## + restecg 2 224.44 246.44
## + sex 1 226.74 246.74
## <none> 230.40 248.40
## + chol 1 229.38 249.38
## + age 1 230.00 250.00
## + fbs 1 230.12 250.12
##
## Step: AIC=233.32
## hd ~ thal + ca + cp + oldpeak
##
## Df Deviance AIC
## + slope 2 204.79 228.79
## + thalach 1 209.22 231.22
## + exang 1 209.42 231.42
## + sex 1 209.88 231.88
## + trestbps 1 209.97 231.97
## + restecg 2 208.94 232.94
## <none> 213.32 233.32
## + chol 1 212.72 234.72
## + age 1 213.05 235.05
## + fbs 1 213.17 235.17
##
## Step: AIC=228.79
## hd ~ thal + ca + cp + oldpeak + slope
##
## Df Deviance AIC
## + sex 1 198.20 224.20
## + trestbps 1 201.03 227.03
## + exang 1 201.93 227.93
## <none> 204.79 228.79
## + thalach 1 203.34 229.34
## + restecg 2 201.52 229.52
## + chol 1 204.48 230.48
## + fbs 1 204.63 230.63
## + age 1 204.79 230.79
##
## Step: AIC=224.2
## hd ~ thal + ca + cp + oldpeak + slope + sex
##
## Df Deviance AIC
## + trestbps 1 192.57 220.57
## + exang 1 194.97 222.97
## + thalach 1 195.92 223.92
## <none> 198.20 224.20
## + chol 1 196.71 224.71
## + restecg 2 195.02 225.02
## + age 1 197.94 225.94
## + fbs 1 197.97 225.97
##
## Step: AIC=220.57
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps
##
## Df Deviance AIC
## + exang 1 189.76 219.76
## + thalach 1 189.78 219.78
## <none> 192.57 220.57
## + chol 1 191.53 221.53
## + fbs 1 191.85 221.85
## + restecg 2 190.51 222.51
## + age 1 192.56 222.56
##
## Step: AIC=219.76
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps + exang
##
## Df Deviance AIC
## + thalach 1 187.74 219.74
## <none> 189.76 219.76
## + chol 1 188.74 220.74
## + fbs 1 188.85 220.85
## + age 1 189.75 221.75
## + restecg 2 187.77 221.77
##
## Step: AIC=219.74
## hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps + exang +
## thalach
##
## Df Deviance AIC
## <none> 187.74 219.74
## + chol 1 186.40 220.40
## + fbs 1 186.88 220.88
## + age 1 187.29 221.29
## + restecg 2 185.76 221.76
summary(step_model)
##
## Call:
## glm(formula = hd ~ thal + ca + cp + oldpeak + slope + sex + trestbps +
## exang + thalach, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0614 -0.4830 -0.1201 0.3257 2.9129
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.51296 2.45370 -2.654 0.007946 **
## thal6 -0.37503 0.78094 -0.480 0.631064
## thal7 1.37461 0.42845 3.208 0.001335 **
## ca1 2.16201 0.49319 4.384 1.17e-05 ***
## ca2 2.94081 0.72291 4.068 4.74e-05 ***
## ca3 2.00426 0.89728 2.234 0.025502 *
## cp2 1.46339 0.79116 1.850 0.064359 .
## cp3 0.30696 0.68387 0.449 0.653536
## cp4 2.44156 0.69208 3.528 0.000419 ***
## oldpeak 0.44279 0.22944 1.930 0.053623 .
## slope2 1.31410 0.47721 2.754 0.005893 **
## slope3 0.56249 0.91149 0.617 0.537160
## sexM 1.55489 0.52222 2.977 0.002907 **
## trestbps 0.02493 0.01070 2.329 0.019861 *
## exang1 0.63056 0.43943 1.435 0.151300
## thalach -0.01493 0.01071 -1.394 0.163450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 409.95 on 296 degrees of freedom
## Residual deviance: 187.74 on 281 degrees of freedom
## AIC: 219.74
##
## Number of Fisher Scoring iterations: 6
##Calculation of Probability After the model is created, we use the same dataset as a test dataset and try to predict the probability of hd. After the probabilities are generated, we need to classify the binary events (have or have not) based on the probability. The probability of 0.5 (50%) is commonly used as a benchmark for binary events, i.e., any probability above 50% is considered a “have” and vice versa
test$prob<-predict(step_model,test,type='response')
test$death_pred<-ifelse(test$prob>=0.5,1,0)
After the classification process, we can create the confusion matrix to see how accurately our model can predict the hd The confusion matrix below shows that 78 out of 89 observations are predicted correctly (87% accuracy). We can also see that the model predicts the number of people who have “hd” (43 correct predictions) better than the number of people havent (35 correct predictions).
table(test$death_pred,test$hd)
##
## Healthy Unhealthy
## 0 43 6
## 1 5 35
##ROC & AUC
Moreover, generating a ROC curve using the pROC package to see the specificity and sensitivity of the model. We can also generate the area below the curve (AUC) from the ROC curve.
plot(roc(test$hd,test$death_pred),col='red')
## Setting levels: control = Healthy, case = Unhealthy
## Setting direction: controls < cases
auc(roc(test$hd,test$death_pred))
## Setting levels: control = Healthy, case = Unhealthy
## Setting direction: controls < cases
## Area under the curve: 0.8747
##Regression Plot
test %>%
arrange(prob) %>%
mutate(rank=rank(prob),Event=ifelse(prob>=0.5,'have','have not')) %>%
ggplot(aes(rank,prob))+
geom_point(aes(color=Event))+
geom_smooth(method = "glm",
method.args = list(family = "binomial")) +
ggtitle('Stepwise Logistic Regression',subtitle='Predicting diagnosis of heart disease')
## `geom_smooth()` using formula 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
To sum up, the correctness and AUC values for this model points out to that the Stepwise Logistic Regression model is a appropriate model for predict diagnosis heart disease