Logistic Regression

Harold Nelson

4/12/2022

Logistic Regression

This model is used for models with a binary outcome. As an example, we will consider the problem of identifying gender based on other variables. We will continue using the cdc2 dataframe.

Load the Data

load("cdc2.Rdata")

Setup

Get the packages we need.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Create the Target

We could base our model on gender, but it is best to create a target which is either 0/1 or True/False. This leaves no doubt about which value we are predicting.

Create a boolean variable is_male. It is true if the gender is “m”. Do a table of is_male and gender.

Solution

cdc2 = cdc2 %>% 
  mutate(is_male = gender == "m")

table(cdc2$gender,cdc2$is_male)
##    
##     FALSE  TRUE
##   m     0  9566
##   f 10431     0

For today’s exercise we will not split the data into train and test. If we were really serious, we would do this.

Model 1

For our first model, we’ll use just the variable height. Create model as a logistic regression model predicting is_male on the basis of height. Do a summary of model.

  1. Use glm() instead of lm()

  2. Include the argument “family = ‘binomial’” in the call.

Solution

model = glm(is_male ~ height, data = cdc2, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = is_male ~ height, family = "binomial", data = cdc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8190  -0.4348  -0.1127   0.3991   5.0201  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -46.211327   0.628564  -73.52   <2e-16 ***
## height        0.685936   0.009343   73.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27684  on 19996  degrees of freedom
## Residual deviance: 14046  on 19995  degrees of freedom
## AIC: 14050
## 
## Number of Fisher Scoring iterations: 6

Predicting

Create predicted probabilities for the observations in cdc2 using model. Include type = “response” to get probabilities as prediction. Call this vector probs.

Solution

probs = predict(model, data = cdc2, type = "response")

Threshold

We now have the probability that each of the observations has gender “m”. To convert the probability that an observation is male, we need a threshold probability, above which we will predict that is_male is true. The sensible solution is to use the mean of is_male. Compute this and use it to get a vector of predictions, preds.

Solution

threshold = mean(cdc2$is_male)
preds = probs > threshold

Accuracy

Now we can compute a value of accuracy by comparing preds with cdc2$is_male. Do this.

Solution

accuracy = mean(preds == cdc2$is_male)
accuracy
## [1] 0.8507776

Independence from Threshold

The value of accuracy depends on both the model and the chosen threshold. The AUC, which depends on the ROC depends on the model alone. Recall the steps to construct this from the Datacamp course.

Solution

ROC = roc(cdc2$is_male,probs)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(ROC, col = "blue")

auc(ROC)
## Area under the curve: 0.9233

Exercise

Put all of the steps above into a single black of code. Then use the code to examine another model, in which smoke100 has been added to the model. What happens to accuracy and AUC.

Solution

model = glm(is_male ~ height + smoke100, data = cdc2, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = is_male ~ height + smoke100, family = "binomial", 
##     data = cdc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7726  -0.4735  -0.1046   0.4322   4.9798  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -46.276340   0.630060 -73.448  < 2e-16 ***
## height        0.684510   0.009353  73.187  < 2e-16 ***
## smoke100      0.335962   0.042734   7.862 3.79e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27684  on 19996  degrees of freedom
## Residual deviance: 13984  on 19994  degrees of freedom
## AIC: 13990
## 
## Number of Fisher Scoring iterations: 6
probs = predict(model, data = cdc2, type = "response")

threshold = mean(cdc2$is_male)
preds = probs > threshold

accuracy = mean(preds == cdc2$is_male)
accuracy
## [1] 0.8456769
ROC = roc(cdc2$is_male,probs)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(ROC, col = "blue")

auc(ROC)
## Area under the curve: 0.9247

Exercise

Try various combinations of variables and see if you can improve on the AUC.