An Example of a Logistic Model

library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
library(tidyverse)
## ── Attaching packages ──────────────────
## ✔ ggplot2 3.0.0       ✔ purrr   0.2.5  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.1       ✔ stringr 1.4.0  
## ✔ readr   1.1.1       ✔ forcats 0.3.0
## ── Conflicts ── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(broom)

Create a binary variable that marks counties in the state of Washington. Also keep only counties from Alabama or Washington to make the problem easier.

countyComplete %>% 
  filter(state %in% c("Washington","Alabama")) %>% 
  mutate(WA = ifelse(state =="Washington",1,0)) -> ccm

# Make sure it worked
table(ccm$WA)
## 
##  0  1 
## 67 39

Build a Model

The model should identify counties in Washington based on characteristics in the countyComplete (ccm) dataframe.

WAMod = glm(WA ~ area + poverty + median_household_income,data=ccm,family=binomial)
summary(WAMod)
## 
## Call:
## glm(formula = WA ~ area + poverty + median_household_income, 
##     family = binomial, data = ccm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4225  -0.4632  -0.2236   0.2105   2.2696  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.670e+00  4.718e+00  -0.990    0.322    
## area                     3.428e-03  8.737e-04   3.923 8.74e-05 ***
## poverty                 -1.766e-01  1.427e-01  -1.237    0.216    
## median_household_income  8.484e-05  6.864e-05   1.236    0.216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.462  on 105  degrees of freedom
## Residual deviance:  65.368  on 102  degrees of freedom
## AIC: 73.368
## 
## Number of Fisher Scoring iterations: 6

Check Performance

ProbWA = predict(WAMod,type="response")
PredWA = ProbWA > .5

# Create the confusion matrix
table(ccm$WA,PredWA)
##    PredWA
##     FALSE TRUE
##   0    62    5
##   1     8   31
# Compute the overall accuracy rate
AccRate = mean(PredWA == ccm$WA)
AccRate
## [1] 0.8773585

Of the 39 counties in Washington, 31 were classified correctly and 8 were incorrectly identified as Alabama counties. In the case of Alabama, 62 counties were identified correctly and 8 were classified as Washington counties.

The overall accuracy was about 88%.

Some Extras to Satisfy my Curiosity

Augment the Model

AugWA = augment(WAMod)
AugWA$state = ccm$state
AugWA$name = ccm$name
AugWA$Pred = PredWA
glimpse(AugWA)
## Observations: 106
## Variables: 14
## $ WA                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ area                    <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78,…
## $ poverty                 <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0…
## $ median_household_income <dbl> 53255, 50147, 33219, 41770, 45549, 31602…
## $ .fitted                 <dbl> 0.01400616, 2.87963665, -3.23291267, -1.…
## $ .se.fit                 <dbl> 0.5255341, 0.8829022, 0.8857175, 0.61730…
## $ .resid                  <dbl> -1.1833637, -2.4225077, -0.2781527, -0.7…
## $ .hat                    <dbl> 0.06904316, 0.03926023, 0.02864931, 0.06…
## $ .sigma                  <dbl> 0.7951811, 0.7659812, 0.8040009, 0.80106…
## $ .cooksd                 <dbl> 2.019688e-02, 1.893614e-01, 2.994098e-04…
## $ .std.resid              <dbl> -1.2264602, -2.4715095, -0.2822249, -0.7…
## $ state                   <fct> Alabama, Alabama, Alabama, Alabama, Alab…
## $ name                    <fct> Autauga County, Baldwin County, Barbour …
## $ Pred                    <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, …

Check the Washington misses.

WaMiss = filter(AugWA,Pred==FALSE & state == "Washington") %>% select(name) 
WaMiss
##               name
## 1    Asotin County
## 2  Columbia County
## 3  Garfield County
## 4    Island County
## 5     Mason County
## 6   Pacific County
## 7  San Juan County
## 8 Wahkiakum County