Harold Nelson
4/12/2022
This model is used for models with a binary outcome. As an example, we will consider the problem of identifying gender based on other variables. We will continue using the cdc2 dataframe.
Get the packages we need.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
We could base our model on gender, but it is best to create a target which is either 0/1 or True/False. This leaves no doubt about which value we are predicting.
Create a boolean variable is_male. It is true if the gender is “m”. Do a table of is_male and gender.
##
## FALSE TRUE
## m 0 9566
## f 10431 0
For today’s exercise we will not split the data into train and test. If we were really serious, we would do this.
For our first model, we’ll use just the variable height. Create model as a logistic regression model predicting is_male on the basis of height. Do a summary of model.
Use glm() instead of lm()
Include the argument “family = ‘binomial’” in the call.
##
## Call:
## glm(formula = is_male ~ height, family = "binomial", data = cdc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8190 -0.4348 -0.1127 0.3991 5.0201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.211327 0.628564 -73.52 <2e-16 ***
## height 0.685936 0.009343 73.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27684 on 19996 degrees of freedom
## Residual deviance: 14046 on 19995 degrees of freedom
## AIC: 14050
##
## Number of Fisher Scoring iterations: 6
Create predicted probabilities for the observations in cdc2 using model. Include type = “response” to get probabilities as prediction. Call this vector probs.
We now have the probability that each of the observations has gender “m”. To convert the probability that an observation is male, we need a threshold probability, above which we will predict that is_male is true. The sensible solution is to use the mean of is_male. Compute this and use it to get a vector of predictions, preds.
Now we can compute a value of accuracy by comparing preds with cdc2$is_male. Do this.
The value of accuracy depends on both the model and the chosen threshold. The AUC, which depends on the ROC depends on the model alone. Recall the steps to construct this from the Datacamp course.
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.9233
Put all of the steps above into a single black of code. Then use the code to examine another model, in which smoke100 has been added to the model. What happens to accuracy and AUC.
##
## Call:
## glm(formula = is_male ~ height + smoke100, family = "binomial",
## data = cdc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7726 -0.4735 -0.1046 0.4322 4.9798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.276340 0.630060 -73.448 < 2e-16 ***
## height 0.684510 0.009353 73.187 < 2e-16 ***
## smoke100 0.335962 0.042734 7.862 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27684 on 19996 degrees of freedom
## Residual deviance: 13984 on 19994 degrees of freedom
## AIC: 13990
##
## Number of Fisher Scoring iterations: 6
probs = predict(model, data = cdc2, type = "response")
threshold = mean(cdc2$is_male)
preds = probs > threshold
accuracy = mean(preds == cdc2$is_male)
accuracy
## [1] 0.8456769
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.9247
Try various combinations of variables and see if you can improve on the AUC.