Setup

Get the packages we need. We need the tidyvere and pROC. You will need to install pROC before proceeding.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Create the Target

We could base our model on gender, but it is best to create a target which is either 0/1 or True/False. This leaves no doubt about which value we are predicting.

Create a boolean variable is_male. It is true if the gender is “m”. Do a table of is_male and gender.

Solution

cdc2 = cdc2 %>% 
  mutate(is_male = gender == "m")

table(cdc2$gender,cdc2$is_male)

##    
##     FALSE  TRUE
##   m     0  9566
##   f 10431     0

For today’s exercise we will not split the data into train and test. If we were really serious, we would do this.

Model 1

For our first model, we’ll use just the variable height. Create model1 as a logistic regression model predicting is_male on the basis of height. Do a summary of model1.

Use glm() instead of lm()
Include the argument “family = ‘binomial’” in the call.

Solution

model1 = glm(is_male ~ height, data = cdc2, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = is_male ~ height, family = "binomial", data = cdc2)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -46.211327   0.628564  -73.52   <2e-16 ***
## height        0.685936   0.009343   73.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27684  on 19996  degrees of freedom
## Residual deviance: 14046  on 19995  degrees of freedom
## AIC: 14050
## 
## Number of Fisher Scoring iterations: 6

Predicting

Create predicted probabilities for the observations in cdc2 using model1. Include type = “response” to get probabilities as prediction. Call this vector probs.

Solution

probs = predict(model1, data = cdc2, type = "response")

Threshold

We now have the probability that each of the observations has gender “m”. To convert the probability that an observation is male to a prediction, we need a threshold probability, above which we will predict that is_male is true. The sensible solution is to use the mean of is_male. Compute this and use it to get a vector of predictions, preds.

Solution

threshold = mean(cdc2$is_male)
preds = probs > threshold

Accuracy

Now we can compute a value of accuracy by comparing preds with cdc2$is_male. Do this.

Solution

accuracy = mean(preds == cdc2$is_male)
accuracy

## [1] 0.8507776

Independence from Threshold

The value of accuracy depends on both the model and the chosen threshold. The AUC, which depends on the ROC depends on the model alone. Recall the steps to construct this from the Datacamp course.

Solution

ROC = roc(cdc2$is_male,probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(ROC, col = "blue")

auc(ROC)

## Area under the curve: 0.9233

Exercise

Put all of the steps above into a single black of code. Then use the code to examine another model, in which smoke100 has been added to the model. What happens to accuracy and AUC.

Solution

model2 = glm(is_male ~ height + smoke100, data = cdc2, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = is_male ~ height + smoke100, family = "binomial", 
##     data = cdc2)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -46.276340   0.630060 -73.448  < 2e-16 ***
## height        0.684510   0.009353  73.187  < 2e-16 ***
## smoke100      0.335962   0.042734   7.862 3.79e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27684  on 19996  degrees of freedom
## Residual deviance: 13984  on 19994  degrees of freedom
## AIC: 13990
## 
## Number of Fisher Scoring iterations: 6

probs = predict(model2, data = cdc2, type = "response")

threshold = mean(cdc2$is_male)
preds = probs > threshold

accuracy = mean(preds == cdc2$is_male)
accuracy

## [1] 0.8456769

ROC = roc(cdc2$is_male,probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(ROC, col = "blue")

auc(ROC)

## Area under the curve: 0.9247

Logistic Regression

Logistic Regression

Load the Data

Setup

Create the Target

Solution

Model 1

Solution

Predicting

Solution

Threshold

Solution

Accuracy

Solution

Independence from Threshold

Solution

Exercise

Solution

Exercise