Harold Nelson
3/24/2024
This model is used for situations with a binary outcome. As an example, we will consider the problem of identifying gender based on other variables. We will use the cdc2 dataframe. This is a cleaned and enhanced version of cdc.
Get the packages we need. We need the tidyvere and pROC. You will need to install pROC before proceeding.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
We could base our model on gender, but it is best to create a target which is either 0/1 or True/False. This leaves no doubt about which value we are predicting.
Create a boolean variable is_male. It is true if the gender is “m”. Do a table of is_male and gender.
##
## FALSE TRUE
## m 0 9566
## f 10431 0
For today’s exercise we will not split the data into train and test. If we were really serious, we would do this.
For our first model, we’ll use just the variable height. Create model1 as a logistic regression model predicting is_male on the basis of height. Do a summary of model1.
Use glm() instead of lm()
Include the argument “family = ‘binomial’” in the call.
##
## Call:
## glm(formula = is_male ~ height, family = "binomial", data = cdc2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.211327 0.628564 -73.52 <2e-16 ***
## height 0.685936 0.009343 73.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27684 on 19996 degrees of freedom
## Residual deviance: 14046 on 19995 degrees of freedom
## AIC: 14050
##
## Number of Fisher Scoring iterations: 6
Create predicted probabilities for the observations in cdc2 using model1. Include type = “response” to get probabilities as prediction. Call this vector probs.
We now have the probability that each of the observations has gender “m”. To convert the probability that an observation is male to a prediction, we need a threshold probability, above which we will predict that is_male is true. The sensible solution is to use the mean of is_male. Compute this and use it to get a vector of predictions, preds.
Now we can compute a value of accuracy by comparing preds with cdc2$is_male. Do this.
The value of accuracy depends on both the model and the chosen threshold. The AUC, which depends on the ROC depends on the model alone. Recall the steps to construct this from the Datacamp course.
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.9233
Put all of the steps above into a single black of code. Then use the code to examine another model, in which smoke100 has been added to the model. What happens to accuracy and AUC.
##
## Call:
## glm(formula = is_male ~ height + smoke100, family = "binomial",
## data = cdc2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.276340 0.630060 -73.448 < 2e-16 ***
## height 0.684510 0.009353 73.187 < 2e-16 ***
## smoke100 0.335962 0.042734 7.862 3.79e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27684 on 19996 degrees of freedom
## Residual deviance: 13984 on 19994 degrees of freedom
## AIC: 13990
##
## Number of Fisher Scoring iterations: 6
probs = predict(model2, data = cdc2, type = "response")
threshold = mean(cdc2$is_male)
preds = probs > threshold
accuracy = mean(preds == cdc2$is_male)
accuracy
## [1] 0.8456769
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
## Area under the curve: 0.9247
Try various combinations of variables and see if you can improve on the AUC.