Logistic Regression for Machine Learning

Introduction

In this tutorial, we’ll be going through a simple example of logistic regression in R. Logistic regression is essentially a way of mathematically predicting the results of a dichotomous outcome variable from some predictor variables. At its foundation, it’s much like linear regression, so if you aren’t familiar with linear regression, it may make sense to first check out my tutorial on that topic (I’ll skip over some topics here that I cover in that tutorial). There are some important differences from linear regression, however, several of which we’ll discuss.

What do I mean by dichotomous outcome variable? This could be a topic where the choices are like “yes or no”, “positive or negative”, “true or false”, etc. A prototypical example is whether or not someone is diagnosed with a disease or other health condition. Appropriately, then, we will be predicting heart disease diagnosis in this tutorial.

Prepare data

We will be using a commonly-used dataset for demonstrating statistical models of categorization: the heart disease dataset from the UCI Machine Learning Repository. This repository provides four different heart disease datasets that utilize the same variables. We will specifically be looking at the Hungarian dataset.

Before we get too far ahead of ourselves, however, let’s load in some packages we want to use for our analyses. All packages we want for now can be found in the tidyverse:

#If you don't have the tidyverse installed, first run install.packages("tidyverse")
library(tidyverse)

Load data

Now we can load in the dataset. We’ll get it online directly from the UCI Machine Learning Repository.

dat <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data",
                col_names = FALSE)

Now that we have the data, let’s take a glimpse at it.

glimpse(dat)

## Rows: 294
## Columns: 14
## $ X1  <dbl> 28, 29, 29, 30, 31, 32, 32, 32, 33, 34, 34, 34, 35, 35, 35, 35, 36~
## $ X2  <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, ~
## $ X3  <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 3, 2, 2, 2, 1, 4, 2, 2, 2, 3, 3, 3, 2, 3, ~
## $ X4  <chr> "130", "120", "140", "170", "100", "105", "110", "125", "120", "13~
## $ X5  <chr> "132", "243", "?", "237", "219", "198", "225", "254", "298", "161"~
## $ X6  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "~
## $ X7  <chr> "2", "0", "0", "1", "1", "0", "0", "0", "0", "0", "1", "0", "1", "~
## $ X8  <chr> "185", "160", "170", "170", "150", "165", "184", "155", "185", "19~
## $ X9  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "~
## $ X10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ~
## $ X11 <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ X12 <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ X13 <chr> "?", "?", "?", "6", "?", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ X14 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

Add variable names

Hmm, one big issue is that our variables do not have names. Fortunately, the repository provides variable names on the site. Thus, we’ll add them to the dataset as follows:

colnames(dat) <- c(
  "age",
  "sex",
  "cp",
  "trestbps",
  "chol",
  "fbs",
  "restecg",
  "thalach",
  "exang",
  "oldpeak",
  "slope",
  "ca",
  "thal",
  "hd"
)

Let’s take another glimpse to make sure this worked.

glimpse(dat)

## Rows: 294
## Columns: 14
## $ age      <dbl> 28, 29, 29, 30, 31, 32, 32, 32, 33, 34, 34, 34, 35, 35, 35, 3~
## $ sex      <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0~
## $ cp       <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 3, 2, 2, 2, 1, 4, 2, 2, 2, 3, 3, 3, 2~
## $ trestbps <chr> "130", "120", "140", "170", "100", "105", "110", "125", "120"~
## $ chol     <chr> "132", "243", "?", "237", "219", "198", "225", "254", "298", ~
## $ fbs      <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "~
## $ restecg  <chr> "2", "0", "0", "1", "1", "0", "0", "0", "0", "0", "1", "0", "~
## $ thalach  <chr> "185", "160", "170", "170", "150", "165", "184", "155", "185"~
## $ exang    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "~
## $ oldpeak  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
## $ slope    <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ ca       <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ thal     <chr> "?", "?", "?", "6", "?", "?", "?", "?", "?", "?", "?", "?", "~
## $ hd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Excellent! Our variables now have names. There are a few other issues in the dataset that we’d typically want to take care of up front, but for this tutorial I’ll wait to get into those until the relevant point where they’ll become problematic. So stay tuned!

Prepare dependent variable

The value from this dataset that we are actually looking to predict is the heart disease variable dat$hd. Currently, this is a numeric variable labeled 0 or 1. It makes sense to make this a dichotomous variable with clear labels. We want a 1 value to represent heart disease and a 0 value to represent no heart disease. We can make this change in a few ways. Here’s one way to do so.

dat <- mutate(dat, hd = factor(hd,
                          levels = c(0, 1),
                          labels = c("no heart disease",
                                     "heart disease")))
class(dat$hd) #Should indicate the variable is now a factor

## [1] "factor"

table(dat$hd) #Will tell us how many people don't and do have heart disease

## 
## no heart disease    heart disease 
##              188              106

We also want to check if there are any missing values of our dependent variable.

filter(dat, is.na(hd))

## # A tibble: 0 x 14
## # ... with 14 variables: age <dbl>, sex <dbl>, cp <dbl>, trestbps <chr>,
## #   chol <chr>, fbs <chr>, restecg <chr>, thalach <chr>, exang <chr>,
## #   oldpeak <dbl>, slope <chr>, ca <chr>, thal <chr>, hd <fct>

The above code should return only results where hd is missing. This returns no values, which indicates there are no missing values. Excellent!

Train and test data

Let’s now split our data into training and test datasets, just like we did in the linear regression tutorial.

set.seed(94) #If you enter this seed, you should get the same results I get on the following random code
dat <- mutate(dat, pID = row_number()) #Gives each row in the original dataset a unique identifier, based on its row number

train <- slice_sample(dat, prop = .7, replace = FALSE) #Takes a randomly-selected 70% of the rows from the original dataset to serve as the training dataset. 
test <- filter(dat, !pID %in% train$pID) #Takes all rows from dat with a pID NOT in train to create the new test dataset. In other words, it takes the 30% of rows not included in the train dataset.

Now we can start modeling our data!

Simple model

Let’s start with a simple potential predictor of heart disease: sex.

Let’s start by checking the sex variable for missing data. We’ll also check the class of the variable.

filter(train, is.na(sex))

## # A tibble: 0 x 15
## # ... with 15 variables: age <dbl>, sex <dbl>, cp <dbl>, trestbps <chr>,
## #   chol <chr>, fbs <chr>, restecg <chr>, thalach <chr>, exang <chr>,
## #   oldpeak <dbl>, slope <chr>, ca <chr>, thal <chr>, hd <fct>, pID <int>

class(train$sex)

## [1] "numeric"

So good news first: there are no missing values of sex in our training data.

Make sex into a factor variable

The bad news: sex is currently a numeric variable, where we want it to be a factor, just like our heart disease variable was. We can change it in the same way, but we need to be careful! We should make this change in all of the main data, the training data, and the test data to keep everything cosistent (NOTE: You can see why typically you’d want to make these corrections at the beginning of your data analysis. If we’d done this before splitting into train and test data, we’d only need to do it on our main dataset.)

We know from the data documentation on the UCI Repository site that 0 = Female and 1 = Male, so we’ll make that change in each dataset:

dat <- mutate(dat, sex = factor(sex,
                          levels = c(0, 1),
                          labels = c("F",
                                     "M")))
train <- mutate(train, sex = factor(sex,
                          levels = c(0, 1),
                          labels = c("F",
                                     "M")))
test <- mutate(test, sex = factor(sex,
                          levels = c(0, 1),
                          labels = c("F",
                                     "M")))

Making our first model

Now that we have the sex variable factorized, let’s make a simple model predicting heart disease from sex. Here, we’ll be using the glm() function, which stands for “generalized linear model”. (Recall that in linear regression, we used the basic ‘linear model’ function lm()). This glm() call will otherwise be much like our lm() calls from the linear regression tutorial, except we also need to call a family = "binomial" argument to specify that we are conducting a logistic regression. We can do this like so:

logmodel1 <- glm(hd ~ sex, data = train, family = "binomial")

Interpreting results

Now let’s take a look at our model output:

summary(logmodel1)

## 
## Call:
## glm(formula = hd ~ sex, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1432  -1.1432  -0.5863   1.2121   1.9214  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6740     0.3632  -4.608 4.06e-06 ***
## sexM          1.5929     0.3988   3.994 6.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 274.23  on 204  degrees of freedom
## Residual deviance: 254.65  on 203  degrees of freedom
## AIC: 258.65
## 
## Number of Fisher Scoring iterations: 3

So now we have some good output from our model! But… how do we make sense of it? Let’s take a look.

Regression equation and log odds

First, we may notice that we have coefficients, just like we did with linear regression. It makes sense to look at the p-values. We see that the sex parameter has a very low p-value, so a model that accounts for sex predicts heart disease better than one that simply takes the sample-wise average. But how do we interpret the coefficient estimates?

Much like with linear regression, we can make an equation: -1.67 + 1.59 * 'patient is male' (where ‘patient is male’ = 1 when they are male and 0 when female). What differs is the left side of the equation–the “y” component. Whereas with linear regression the “y” is simply the predicted output variable, here it is the predicted “log odds” of the output variable.

Log odds are not in themselves terribly interpretable. So we may want to convert them into a more interpretable metric. Let’s look at an example:

A woman’s log odds of having heart disease is equal to -1.67 + 1.59 * 0. Or simply -1.67. Thus a woman’s log odds of having heart disease is -1.67. We can covert this to “odds” by exponentiating the value:

exp(-1.67)

## [1] 0.1882471

We get a value of approximately 0.19. So the odds a woman has heart disease, according to this model, is approximately 0.19. What are odds in this context? They are simply the probability of something occurring over the probability of it not occurring (p / (1-p)). When odds are <1, it’s sometimes more interpretable to take the inverse (i.e. the probability of something not occurring over it occurring).

1/exp(-1.67)

## [1] 5.312168

So from this, we can see that this model predicts that a woman is about 5 times more likely to not have heart disease in this sample than to have it. So once we make some adjustments to our coefficients, the interpretation isn’t too complicated!

Odds for men

We can do the same for men. Now the predicted log odds are -1.67 + 1.59 * 1 or simply -.08. Let’s repeat the calculations we did for women.

exp(-.08)

## [1] 0.9231163

1/exp(-.08)

## [1] 1.083287

So here we see that the odds are approximately 1. This model predicts that men in our sample are about equally like to have heart disease or to not have heart disease.

Odds ratio

Another way to interpret our way to interpret our coefficient is as a log odds ratio. I.e. the log odds ratio of male risk of heart disease to female heart disease is approximately 1.59. As before, this becomes more interpretable if we exponentiate it.

exp(1.59)

## [1] 4.903749

exp(coef(logmodel1)[2])

##     sexM 
## 4.917749

So the odds ratio that men will have heart disease vs. women is about 4.9. Practically, this means that, the model predicts that men in this sample are 5 times more likely to have heart disease than women.

R-squared

An attentive reader may note another crucial difference between this output and our linear regression output from our previous tutorial: There’s no R-squared value!

The reason for this is simple: There’s no one universally-agreed upon model for calculating an R-squared for logistic regression. I won’t go too far into the weeds of explaining this mathematically, but it’s worth at least understanding that it’s mathematically impossible to compute r-squared values using the same method that is used with linear regression.

In lieu of an actual R-squared, many alternative metrics have been proposed. One common metric is McFadden’s pseudo-R-squared. Again, I won’t go too far into the weeds explaining the math behind this metric, but I will link to an excellent video that explains it simply.

For now, suffice it to say that we can easily calculate a pseudo-R-squared from our output. Here’s how:

ll.null <- logmodel1$null.deviance /-2
ll.proposed <- logmodel1$deviance/-2
(ll.null - ll.proposed) / ll.null

## [1] 0.07140187

1 - pchisq(2*(ll.proposed - ll.null), df = (length(logmodel1$coefficients)-1))

## [1] 9.644178e-06

The top value above is our pseudo-R-squared, while the second is the p-value for this pseudo-R-squared. The p-value is quite small, so it appears this model outperforms the null model. However, the pseudo-R-squared is not large, so we’re not accounting for much of the variance by simply accounting for sex. Since that’s only one predictor, that’s not surprising.

Multiple logistic regression

Let’s now fit a slightly more complicated logistic model. Examining the variables in our dataset, I’ve identified a few potential candidates for good predictors of heart disease. The first is age. It makes sense that older people will have a higher risk of heart disease than younger. We can also include chest pain (the variable “cp”) and resting systolic blood pressure (“trestbps”).

Checking and preparing the data.

As always, it’s very important to understand the structure of our data, and clean/prepare it if necessary.

Age

We expect age to be a numeric vector. Let’s see if it is:

class(dat$age)

## [1] "numeric"

Looks good! (Foreshadowing warning, though, there’s something else we should have checked here… I’ll come back to that.)

One modification we may want to make is to give our age value a meaningful zero point to make our model more interpretable. We can make the zero-point whatever we want, but it should be sensible. We’ll make the zero point age 50 here, and we’ll name our new varaible age_minus_50 so we remember this change:

dat$age_minus_50 <- dat$age - 50
train$age_minus_50 <- train$age - 50
test$age_minus_50 <- test$age - 50

Chest pain

By reading our documentation for this dataset, we know that chest pain should be a factor, where values 1, 2, and 3 represent different types of chest pain (specifically typical angina, atypical angina, and non-anginal pain, respectively), while value 4 represents a lack of chest pain. Let’s take a look at our variable and see if it does reflect this.

class(dat$cp)

## [1] "numeric"

Oh no, it’s a numeric vector. If we leave it that way, our model will interpret this variable as a continuous, rather than discrete, measure. This is obviously problematic. We need to fix it.

dat$cp <- factor(dat$cp, labels = c("typical_angina", "atypical_angina", "non-anginal", "asymptomatic"))
train$cp <- factor(train$cp, labels = c("typical_angina", "atypical_angina", "non-anginal", "asymptomatic"))
test$cp <- factor(test$cp, labels = c("typical_angina", "atypical_angina", "non-anginal", "asymptomatic"))

Note that we had to make the change in our full dataset, as well as our train and test datasets. Again, it would have been wise to do this and all other data-cleaning steps earlier in our workflow, but I delayed it until now for teaching purposes.

We may also want to make “asymptomatic” the reference group in our factor, so we will be comparing each type of pain to someone who lacks chest pain. We can do this like so:

dat$cp <- relevel(dat$cp, ref = "asymptomatic")
train$cp <- relevel(train$cp, ref = "asymptomatic")
test$cp <- relevel(test$cp, ref = "asymptomatic")

Let’s check a table of our values to make sure every possible value is represented now:

table(dat$cp, useNA = "always")

## 
##    asymptomatic  typical_angina atypical_angina     non-anginal            <NA> 
##             123              11             106              54               0

Great, every possible value is now represented by a label, and there are no missing values. It is worth noting at this stage that typical angina is less represented than the other levels of the factor. We should keep that in mind when interpreting the model output.

Blood pressure

Our blood pressure variable should be a numeric vector. Is it?

class(dat$trestbps)

## [1] "character"

It’s not. It’s treated as a character vector. Strange, let’s take a look at the values of this level and their frequency:

table(dat$trestbps)

## 
##   ? 100 105 106 108 110 112 113 115 118 120 122 124 125 128 130 132 135 136 138 
##   1   6   1   1   1  21   3   1   2   2  65   2   2   8   1  54   1   5   1   1 
## 140 142 145 150 155 160 170 180 190 200  92  98 
##  50   1   5  23   1  20   5   6   1   1   1   1

Looks like every value here is a number, aside from one question mark. We’ll simply tell R to make this vector a numeric one, and the question mark should be introduced as a missing value by coercion (the result we want), but we can check this with a table:

dat$trestbps <- as.numeric(dat$trestbps)

## Warning: NAs introduced by coercion

table(dat$trestbps)

## 
##  92  98 100 105 106 108 110 112 113 115 118 120 122 124 125 128 130 132 135 136 
##   1   1   6   1   1   1  21   3   1   2   2  65   2   2   8   1  54   1   5   1 
## 138 140 142 145 150 155 160 170 180 190 200 
##   1  50   1   5  23   1  20   5   6   1   1

Good, the question mark was removed as a missing value and the rest of the values are now numbers. (By default, R will simply remove the row with the missing value when running the model. This is ok for now, but always be sure to think carefully about how you treat missing data.)

Don’t forget to make sure the train and test datasets are consistent:

train$trestbps <- as.numeric(train$trestbps)

## Warning: NAs introduced by coercion

test$trestbps <- as.numeric(test$trestbps)

Lastly, as previously mentioned, regressions are more interpretable if the variables have a meaningful zero point. A blood pressure level of 0 is not terribly meaningful, since no living patient can have that value. There’s various ways we could give the value a meaningful zero point, such as centering around the mean, but we’ll instead center around 120 which is typically regarded a healthy systolic blood pressure level. We’ll rename the variable trestbps_minus_120 so we remember we made this modification.

dat$trestbps_minus_120 <- dat$trestbps - 120
train$trestbps_minus_120 <- train$trestbps - 120
test$trestbps_minus_120 <- test$trestbps - 120

Specifying the model

Now we can specify the model:

logmodel2 <- glm(hd ~ sex + age_minus_50 + cp + trestbps_minus_120, data = train, family = "binomial")

and look at the model output:

summary(logmodel2)

## 
## Call:
## glm(formula = hd ~ sex + age_minus_50 + cp + trestbps_minus_120, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9088  -0.8054  -0.3420   0.8557   2.3005  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.78348    0.47510  -1.649  0.09913 .  
## sexM                1.39526    0.45009   3.100  0.00194 ** 
## age_minus_50        0.01423    0.02344   0.607  0.54367    
## cptypical_angina   -0.42403    0.77294  -0.549  0.58328    
## cpatypical_angina  -2.78902    0.49862  -5.593 2.23e-08 ***
## cpnon-anginal      -1.33708    0.43691  -3.060  0.00221 ** 
## trestbps_minus_120  0.01818    0.01029   1.767  0.07724 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273.24  on 203  degrees of freedom
## Residual deviance: 201.60  on 197  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 215.6
## 
## Number of Fisher Scoring iterations: 5

Interpreting the Model

Intercept

Let’s think though what these coefficients mean, starting with the intercept. Recall that the intercept is the predicted value at the reference group for all factor variables, and at 0 for all numeric values. So, in this specific case, a 50 year-old woman with no chest pain and systolic blood pressure of 120 has log odds of having heart disease of roughly 1.40. What is that in odds?

exp(1.40)

## [1] 4.0552

So the model predicts that such a person is roughly 4 times as likely to have heart disease as not. However, very importantly, the standard error around this estimate is wide and the p-value does not meet the significance threshold, so we can not be very confident that the log odds are meaningfully different from 0 (or, therefore, that the odds are meaningfully different from 1).

Sex

Next, let’s look at the coefficient for “is male”. Like our last model, this coefficient is positive and significant, suggesting that, all else being equal (among features in our model), men are more likely to have heart disease than women. We can compute the odds ratio:

exp(1.40)

## [1] 4.0552

So, from this model, we can estimate among men and women of the same age who have the same type of chest pain and the same blood pressure, men are roughly four times more likely to have heart disease.

Age

Now let’s look at the age coefficient. Notably, it’s not significant. This may be surprising, as we’d expect older people to be at higher risk of heart disease. But let’s look at the distribution of age:

hist(dat$age)

So our sample is overall on the older side, and way may be missing out on the proper variance to see an age effect. This is an important lesson on why it’s important to understand your data. Even if age is probably a predictor of heart disease in the real world, we may not find it in our model, if our data are not a random sample of the real world. Thus, as I alluded to before, we should have made sure we understood the nature of our sample, and the distribution of this variable, early in the process of modeling these data.

Chest Pain

Now looking at chest pain, we see that two of the three coefficients are significant, so people with these types of chest pain have significantly different rates of heart disease than do those who are asymptomatic. However, these coefficients are negative! That’s very counter-intuitive–those with chest pain had a lower rate of heart disease than those who lacked it.

So what’s going on here? I honestly can’t say with certainty, but this is another lesson on why it’s very important to understand your data–good modeling is only half the battle. One possible explanation: There’s sampling bias in our dataset. Consider the people represented in this dataset; they needed to go to the doctor to be tested for one reason or another. Those with chest pain may have seen sufficient reason to go to the doctor, irrespective of other symptoms or risk factors. Those without chest pain may have had other reasons to go to the doctor, and these reasons may have been more predictive of heart disease.

So again, this variable serves as an important reminder that a good data scientist should interrogate their data closely to really understand what our models can and cannot tell us.

Blood pressure

Finally, we look at blood pressure. We see that an increase in blood pressure is slightly predictive of a higher risk of heart disease, but this coefficient does not reach our standard of significance. We therefore do not have great certainty that this is a real effect, rather than one that was found by random chance.

Pseudo R-squared

We can again compute a pseudo-R^2 to see how well this model fits our data.

ll.null <- logmodel2$null.deviance /-2
ll.proposed <- logmodel2$deviance/-2
(ll.null - ll.proposed) / ll.null

## [1] 0.2621771

1 - pchisq(2*(ll.proposed - ll.null), df = (length(logmodel2$coefficients)-1))

## [1] 1.886269e-13

While this model fit still is not great, we can see, first, that it is significant, and second, it is a substantially better fit than our first model.

We could almost certainly create a better fitting model by including more predictors, but for the sake of keeping things brief, let’s proceed with this model.

Model performance

Let’s test the model performance of our second model. There are a number of ways to test model performance of a logistic regression. One of the simplest is simply “accuracy”, the percentage of correct classifications out of total classifications. Let’s begin by computing this on our training dataset. We can begin by appending predicted probabilities to our dataset. These are given in the vector. logmodel2$fitted.values. we then can generate a predicted outcome from these probabilities. A sensible to way to do this (which we will do) is to assign any probability of >= 50% “heart disease” and anything < 50% “healthy”. Technically, however, we could put this cutoff wherever we wanted. If we wanted to be careful to diagnose people with a somewhat lower degree of risk to make sure they can pursue treatment, we could set the cutoff at 20% for example. And another common route is to test accutracy across a range of cutoffs. But for now we’ll stick with a 50% cutoff.

train <-train %>% 
  filter(!is.na(trestbps_minus_120)) %>% #This is included because we had one missing value for this variable
  mutate(prob = logmodel2$fitted.values, #Appends probabilities to the data frame
         pred = case_when(prob >= .5 ~ "heart disease",
                          prob < .5 ~ "no heart disease"))

Now let’s compute the accuracy:

train %>% 
  mutate(correct_pred = case_when(pred == hd ~ 1,
                                  pred != hd ~ 0)) %>% 
  summarize(accuracy = sum(correct_pred)/n())

## # A tibble: 1 x 1
##   accuracy
##      <dbl>
## 1    0.779

And we see that this model had about 78% accuracy.

Does it do as well on the test data? Let’s find out:

test %>% 
  mutate(prob = predict(logmodel2, newdata = test, type = "response"), # A way of generating probabilities using the predict function
         pred = case_when(prob >= .5 ~ "heart disease",
                          prob < .5 ~ "no heart disease"),
         correct_pred = case_when(pred == hd ~ 1,
                                  pred != hd ~ 0)) %>% 
  summarize(accuracy = sum(correct_pred)/n())

## # A tibble: 1 x 1
##   accuracy
##      <dbl>
## 1    0.876

Interestingly, our model performed even better (at least as measured by accuracy) on our test data than on our training data! It’s important to note, however, that there are other metrics of model performance that we may be interested in. For example, may want to know the sensitivity of our model (true positives divided by all actual positives) or specificity (true negatives divided by all negatives). But for now, for the sake of brevity, we’ll leave it there. That said, it is important to be aware of these different measures of classification performance.

Logistic Regression for Machine Learning

John Michael Kelly

2022-03-15

Introduction

Prepare data

Load data

Add variable names

Prepare dependent variable

Train and test data

Simple model

Make sex into a factor variable

Making our first model

Interpreting results

Regression equation and log odds

Odds for men

Odds ratio

R-squared

Multiple logistic regression

Checking and preparing the data.

Age

Chest pain

Blood pressure

Specifying the model

Interpreting the Model

Intercept

Sex

Age

Chest Pain

Blood pressure

Pseudo R-squared

Model performance