Part 1 Explore the Data

library(readr)
heart <- read_csv("heart.csv")
## View(heart)
str(heart)
## spc_tbl_ [303 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : num [1:303] 1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : num [1:303] 3 2 1 1 0 0 1 1 2 2 ...
##  $ trtbps  : num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : num [1:303] 1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : num [1:303] 0 1 0 1 1 1 0 1 1 1 ...
##  $ thalachh: num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
##  $ exng    : num [1:303] 0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slp     : num [1:303] 0 0 2 2 2 1 1 2 2 2 ...
##  $ caa     : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##  $ thall   : num [1:303] 1 2 2 2 2 1 2 3 3 2 ...
##  $ output  : num [1:303] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trtbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalachh = col_double(),
##   ..   exng = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slp = col_double(),
##   ..   caa = col_double(),
##   ..   thall = col_double(),
##   ..   output = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
prop.healthy <- mean(heart$output)
prop.healthy ## proprotion of dataset with heart attack == 0 
## [1] 0.5445545
dim(heart)
## [1] 303  14
colnames(heart)
##  [1] "age"      "sex"      "cp"       "trtbps"   "chol"     "fbs"     
##  [7] "restecg"  "thalachh" "exng"     "oldpeak"  "slp"      "caa"     
## [13] "thall"    "output"

Convert Categorical Numerical Variables to Factors

heart$sex <- as.factor(heart$sex)
heart$cp <- as.factor(heart$cp)
heart$fbs <- as.factor(heart$fbs)
heart$restecg <- as.factor(heart$restecg)
heart$exng <- as.factor(heart$exng)
heart$slp <- as.factor(heart$slp)
heart$thall <- as.factor(heart$thall)

str(heart)
## spc_tbl_ [303 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trtbps  : num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalachh: num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
##  $ exng    : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slp     : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ caa     : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##  $ thall   : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ output  : num [1:303] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trtbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalachh = col_double(),
##   ..   exng = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slp = col_double(),
##   ..   caa = col_double(),
##   ..   thall = col_double(),
##   ..   output = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Part 2 Building the Model with base - R commands

set.seed(0368408)
sample_index <- sample(nrow(heart),nrow(heart)*0.80)
heart_train <- heart[sample_index,]
heart_test <- heart[-sample_index,]

heart_glm1 <- glm(output~. , family = binomial, data = heart) 

Step-wise Variable Selection with BIC

n <- dim(heart)[1]
heart.step.BIC <- step(heart_glm1, data = heart_train, k = log(n)) #k=ln(n), BIC
## Start:  AIC=315.97
## output ~ age + sex + cp + trtbps + chol + fbs + restecg + thalachh + 
##     exng + oldpeak + slp + caa + thall
## 
##            Df Deviance    AIC
## - restecg   2   204.11 306.96
## - slp       2   206.23 309.08
## - age       1   201.69 310.26
## - fbs       1   201.79 310.35
## - chol      1   202.92 311.48
## - thall     3   214.80 311.93
## - trtbps    1   204.29 312.85
## - thalachh  1   204.33 312.89
## - exng      1   204.86 313.43
## - oldpeak   1   206.69 315.25
## <none>          201.69 315.97
## - sex       1   210.84 319.40
## - cp        3   223.62 320.75
## - caa       1   219.67 328.23
## 
## Step:  AIC=306.96
## output ~ age + sex + cp + trtbps + chol + fbs + thalachh + exng + 
##     oldpeak + slp + caa + thall
## 
##            Df Deviance    AIC
## - slp       2   209.18 300.60
## - age       1   204.14 301.27
## - fbs       1   204.20 301.33
## - thall     3   216.33 302.04
## - chol      1   206.04 303.18
## - thalachh  1   206.84 303.98
## - trtbps    1   207.01 304.14
## - exng      1   207.17 304.30
## - oldpeak   1   209.23 306.36
## <none>          204.11 306.96
## - cp        3   225.77 311.47
## - sex       1   214.42 311.55
## - caa       1   221.87 319.00
## 
## Step:  AIC=300.6
## output ~ age + sex + cp + trtbps + chol + fbs + thalachh + exng + 
##     oldpeak + caa + thall
## 
##            Df Deviance    AIC
## - fbs       1   209.20 294.90
## - age       1   209.23 294.94
## - chol      1   211.63 297.33
## - trtbps    1   211.82 297.52
## - thall     3   223.37 297.65
## - exng      1   212.86 298.57
## - thalachh  1   214.69 300.39
## <none>          209.18 300.60
## - sex       1   218.29 304.00
## - cp        3   230.22 304.50
## - oldpeak   1   219.49 305.20
## - caa       1   225.52 311.22
## 
## Step:  AIC=294.9
## output ~ age + sex + cp + trtbps + chol + thalachh + exng + oldpeak + 
##     caa + thall
## 
##            Df Deviance    AIC
## - age       1   209.25 289.24
## - chol      1   211.65 291.64
## - trtbps    1   211.82 291.81
## - thall     3   223.47 292.04
## - exng      1   212.86 292.86
## - thalachh  1   214.76 294.75
## <none>          209.20 294.90
## - sex       1   218.29 298.29
## - oldpeak   1   219.56 299.55
## - cp        3   231.16 299.73
## - caa       1   225.57 305.56
## 
## Step:  AIC=289.24
## output ~ sex + cp + trtbps + chol + thalachh + exng + oldpeak + 
##     caa + thall
## 
##            Df Deviance    AIC
## - chol      1   211.92 286.20
## - thall     3   223.57 286.42
## - trtbps    1   212.19 286.47
## - exng      1   212.88 287.16
## <none>          209.25 289.24
## - thalachh  1   216.30 290.58
## - sex       1   218.35 292.63
## - oldpeak   1   219.58 293.86
## - cp        3   231.17 294.02
## - caa       1   226.80 301.08
## 
## Step:  AIC=286.2
## output ~ sex + cp + trtbps + thalachh + exng + oldpeak + caa + 
##     thall
## 
##            Df Deviance    AIC
## - trtbps    1   215.12 283.68
## - exng      1   215.79 284.36
## - thall     3   227.31 284.45
## <none>          211.92 286.20
## - thalachh  1   218.34 286.91
## - sex       1   218.94 287.50
## - cp        3   234.12 291.26
## - oldpeak   1   222.94 291.51
## - caa       1   229.42 297.98
## 
## Step:  AIC=283.68
## output ~ sex + cp + thalachh + exng + oldpeak + caa + thall
## 
##            Df Deviance    AIC
## - exng      1   219.45 282.30
## - thall     3   231.53 282.95
## - thalachh  1   220.39 283.25
## <none>          215.12 283.68
## - sex       1   220.86 283.71
## - cp        3   235.69 287.11
## - oldpeak   1   227.14 289.99
## - caa       1   232.89 295.74
## 
## Step:  AIC=282.3
## output ~ sex + cp + thalachh + oldpeak + caa + thall
## 
##            Df Deviance    AIC
## <none>          219.45 282.30
## - sex       1   225.34 282.48
## - thall     3   238.53 284.24
## - thalachh  1   227.10 284.24
## - oldpeak   1   233.42 290.56
## - caa       1   236.92 294.06
## - cp        3   248.54 294.25
summary(heart.step.BIC) ## call 
## 
## Call:
## glm(formula = output ~ sex + cp + thalachh + oldpeak + caa + 
##     thall, family = binomial, data = heart)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.341160   2.129513  -1.569 0.116653    
## sex1        -1.033325   0.432678  -2.388 0.016931 *  
## cp1          1.240396   0.522946   2.372 0.017695 *  
## cp2          2.039127   0.428742   4.756 1.97e-06 ***
## cp3          1.848414   0.597616   3.093 0.001982 ** 
## thalachh     0.023443   0.008809   2.661 0.007782 ** 
## oldpeak     -0.676539   0.191720  -3.529 0.000417 ***
## caa         -0.732589   0.182095  -4.023 5.74e-05 ***
## thall1       1.338050   1.912754   0.700 0.484214    
## thall2       1.655730   1.806816   0.916 0.359467    
## thall3       0.067853   1.812360   0.037 0.970135    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.64  on 302  degrees of freedom
## Residual deviance: 219.45  on 292  degrees of freedom
## AIC: 241.45
## 
## Number of Fisher Scoring iterations: 5
AIC(heart.step.BIC)
## [1] 241.4458
BIC(heart.step.BIC)
## [1] 282.2969
## Call: 
glm(formula = output ~ sex + cp + thalachh + oldpeak + caa + 
    thall, family = binomial, data = heart)
## 
## Call:  glm(formula = output ~ sex + cp + thalachh + oldpeak + caa + 
##     thall, family = binomial, data = heart)
## 
## Coefficients:
## (Intercept)         sex1          cp1          cp2          cp3     thalachh  
##    -3.34116     -1.03332      1.24040      2.03913      1.84841      0.02344  
##     oldpeak          caa       thall1       thall2       thall3  
##    -0.67654     -0.73259      1.33805      1.65573      0.06785  
## 
## Degrees of Freedom: 302 Total (i.e. Null);  292 Residual
## Null Deviance:       417.6 
## Residual Deviance: 219.4     AIC: 241.4

Find the MSPE of our Model

pred_heart <- predict(heart.step.BIC, heart_test)
MSPE.heart <- mean((heart_test$output - pred_heart)^2)
MSPE.heart
## [1] 5.925563
pred.full <- predict(heart_glm1, heart_test)
MSPE.heart.full <- mean((heart_test$output - pred.full)^2)
MSPE.heart.full
## [1] 7.783089

Part 3 Run the Model with Tidymodels

## install.packages("tidymodels")
library(tidyverse)
library(tidymodels)

## Conduct Stratified Sampling
set.seed(0398408)
split_heart <- initial_split(heart, prop = 0.8, strata = "output")
train_heart <- training(split_heart)
test_heart  <- testing(split_heart)


# original response distribution
table(heart$output) %>% prop.table()
## 
##         0         1 
## 0.4554455 0.5445545
# stratified response distribution for 70% Training
table(train_heart$output) %>% prop.table()
## 
##         0         1 
## 0.4545455 0.5454545
# response distribution for test data, 30%
table(test_heart$output) %>% prop.table()
## 
##         0         1 
## 0.4590164 0.5409836

We’ll use logistic_reg() and fit() from tidymodels here.

  • logistic_reg() defines a generalized linear model for binary outcomes. There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The default is glm.

  • fit() fits a logistic regression model to the training data

library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
library(parsnip)

train_heart$output <- as.factor(train_heart$output)
## This is a full model
heart_logit <- logistic_reg() %>%
   fit(output ~ sex + cp + thalachh + exng + oldpeak + 
    caa + thall, family = binomial, data = train_heart)

## determine accuracy
library(dplyr)
library(yardstick)

Here we are going to evaluate our model.

  • accuracy() returns the proportion of correct predictions out of the total number of predictions made. It is calculated by dividing the number of correct predictions by the total number of predictions.
heart_logit %>%
  predict(test_heart) %>%
  bind_cols(test_heart %>% select(output)) %>%
  mutate(output = factor(output)) %>%
  accuracy(truth = "output", estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.852
  • This model is performing at about 85.3 % accuracy

Part 4: A Summary of my findings

What do the results tell us:

In the base - R analysis we saw that the MSPE improved from 7. 78 to 5.93 from the full model to the trained model. So using stepwise BIC resulted in more accurate out of sample predictions.

In the tidymodels analysis, we saw that the accuracy level was 0.852459, or 85% accurate.

Therefore, the predictor variables “sex + cp + thalachh + oldpeak + caa + thall” can be used to predict the presence of a heart attack, with 85% accuracy.

What is my preference:

I found that the tidymodels offered a few advantages. The “strat” argument allows us to stratify our sampling, to ensure that proportions of our response variable are distributed in a way that reflects the full sample.

I found that the accuracy() provided a highly interpretable way of understanding how accurate our model might be.

Last, I like that logstic_reg offers several arguments to pass as engine to allow for many different analysis to take place.

About This Data:

The data I referenced, heart.csv, can be found as a sample provided by Kaggle.com https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset