An Example of a Logistic Model

library(usdata)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Use the dataframe county_complete. Create a binary variable that marks counties in the state of Washington. Also keep only counties from Alabama or Washington to make the problem easier.

county_complete %>% 
  filter(state %in% c("Washington","Alabama")) %>% 
  mutate(WA = ifelse(state =="Washington",1,0)) -> ccm

# Make sure it worked
table(ccm$WA)
## 
##  0  1 
## 67 39

Build a Model

The model should identify counties in Washington based on characteristics in the county_complete (ccm) dataframe. For the first model, use unemployment_rate_2007 and poverty_2017.

WAMod1 = glm(WA ~ unemployment_rate_2007 + poverty_2017,data=ccm,family=binomial)
summary(WAMod1)
## 
## Call:
## glm(formula = WA ~ unemployment_rate_2007 + poverty_2017, family = binomial, 
##     data = ccm)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.1560     1.2675   0.123    0.902    
## unemployment_rate_2007   1.6467     0.3333   4.940 7.82e-07 ***
## poverty_2017            -0.5456     0.1063  -5.130 2.89e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.462  on 105  degrees of freedom
## Residual deviance:  58.548  on 103  degrees of freedom
## AIC: 64.548
## 
## Number of Fisher Scoring iterations: 6

Check Performance

ProbWA = predict(WAMod1,type="response")
PredWA = ProbWA > .5

# Create the confusion matrix
table(ccm$WA,PredWA)
##    PredWA
##     FALSE TRUE
##   0    63    4
##   1     6   33
# Compute the overall accuracy rate
AccRate = mean(PredWA == ccm$WA)
AccRate
## [1] 0.9056604

Of the 39 coun6ies in Washington, 33 were classified correctly and 6 were incorrectly identified as Alabama counties. In the case of Alabama, 63 counties were identified correctly and 4 were classified as Washington counties.

The overall accuracy was about 90%.

Your Task

Create your own model using no more than two variables.