Multivariate Statistics: Module 6 - Intro to Logistic Regression

Author

Dr. Broda

Load in a few packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Load in the data

library(haven)
els <- read_dta("ELS_Logistic.dta")

Quick Data Cleaning and Coding Lesson! Using case_when

glimpse(els)
Rows: 16,197
Columns: 55
$ stu_id   <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 101108, 10110…
$ sch_id   <dbl> 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1…
$ strat_id <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
$ psu      <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ f1sch_id <dbl+lbl> 1011, 1011, 1011, 1011, 1011, 1011, 1011,   -8,   -8, 101…
$ f1univ1  <dbl+lbl> 101, 101, 101, 101, 101, 101, 101, 107, 107, 101, 101, 10…
$ f1univ2a <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ f1univ2b <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 7, 7, 1, 1, 1, 1, 1, 1, 1, 1, 7, 5, …
$ f2univ_p <dbl+lbl> 103, 101, 101, 101, 101, 101, 101, 104, 105, 101, 101, 10…
$ bystuwt  <dbl+lbl> 178.9513,  28.2951, 589.7248, 235.7822, 178.9513, 256.965…
$ bysex    <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, …
$ byrace   <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, 3, 4, 7, …
$ bystlang <dbl+lbl> 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
$ byses1   <dbl+lbl> -0.25,  0.58, -0.85, -0.80, -1.41, -1.07,  0.27, -0.16, -…
$ byses1qu <dbl+lbl> 2, 4, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 1, 2, 4, 3, 3, 1, 1, …
$ bygrdrpt <dbl+lbl>  0,  0,  0, 98,  0,  0, 98,  0,  0, 98,  0,  0,  1,  0, 9…
$ byhomlit <dbl+lbl> 0, 3, 2, 1, 1, 2, 1, 3, 0, 2, 0, 3, 2, 1, 2, 0, 1, 0, 2, …
$ byparasp <dbl+lbl> 5, 7, 7, 6, 2, 3, 5, 5, 5, 2, 5, 3, 6, 5, 6, 3, 5, 7, 3, …
$ byiepflg <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
$ bytxcstd <dbl+lbl> 56.21, 57.66, 66.50, 46.46, 36.17, 30.72, 45.46, 68.39, 4…
$ bytxcqu  <dbl+lbl> 3, 4, 4, 2, 1, 1, 2, 4, 1, 2, 4, 2, 1, 3, 4, 1, 1, 3, 2, …
$ bywrtnga <dbl+lbl>  1.191,  1.191,  0.996, -0.137, -0.435, -0.630, -0.246,  …
$ byxtracu <dbl+lbl> 1, 3, 2, 0, 0, 0, 1, 2, 1, 2, 0, 0, 2, 1, 2, 0, 0, 0, 0, …
$ byhmwrk  <dbl+lbl>  7,  5, -9, 11, 10, 13, 12, 12, -9, 23,  9,  4,  7, 11,  …
$ bytvvigm <dbl+lbl> 99,  4,  1, 99,  4,  3,  2,  3, -9,  3,  8,  1,  4,  8,  …
$ f1qwt    <dbl+lbl>  152.9769,   25.3577,  709.4246,  199.7193,  152.9769,  2…
$ f1pnlwt  <dbl+lbl>  155.6312,   25.4906,  725.6926,  205.1919,  155.6312,  2…
$ f1trscwt <dbl+lbl> 284.6529,  42.0937, 730.2420, 280.0837, 287.2398, 324.182…
$ f2qtscwt <dbl+lbl>   0.0000,  50.3749, 755.4681, 245.5287, 265.5541, 304.492…
$ f2qwt    <dbl+lbl>   0.0000,  28.2772, 660.0555, 198.5202, 153.7132, 213.821…
$ f2f1wt   <dbl+lbl>   0.0000,  29.8462, 713.1807, 227.0875, 163.9503, 221.485…
$ f2bywt   <dbl+lbl>   0.0000,  27.8257, 662.1222, 212.3764, 158.2162, 214.429…
$ bys14    <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, …
$ bys15    <dbl+lbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, …
$ bys24d   <dbl+lbl>  1,  1,  1,  1,  1,  2,  2,  1,  2,  1,  1,  1,  2,  1,  …
$ bys24e   <dbl+lbl>  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  …
$ bys24f   <dbl+lbl>  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  …
$ bys33a   <dbl+lbl>  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  …
$ bys33b   <dbl+lbl>  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  …
$ bys43    <dbl+lbl>  2,  2,  3,  5,  2, 12, -9,  0, -9,  2,  0,  2,  0,  2,  …
$ bys62a   <dbl+lbl> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
$ bys62g   <dbl+lbl> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -…
$ bys66a   <dbl+lbl>  4,  1,  1,  1,  1,  6,  1,  1,  1,  4, -9,  1,  1,  1,  …
$ bys66b   <dbl+lbl>  6,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  …
$ bys66c   <dbl+lbl>  3, -1, -1,  6,  1, -3, -1,  6,  1,  6,  1,  1,  1,  2, -…
$ bys66f   <dbl+lbl>  1,  1, -1,  1,  1, -3,  1,  1,  1,  1,  1,  1, -3, -1,  …
$ bys66g   <dbl+lbl> -3, -3, -1,  1, -3, -9,  1,  1, -3, -1,  1,  1, -3, -1,  …
$ bys67    <dbl+lbl> 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
$ bys71d   <dbl+lbl>  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  …
$ bys71e   <dbl+lbl>  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  …
$ bys83a   <dbl+lbl>  2,  5,  2,  2,  1, -9, -1, -1, -1,  2, -1,  6,  2, -1, -…
$ bys83b   <dbl+lbl>  4,  7,  4,  2,  1, -9, -1, -1, -1,  2, -1,  2,  5, -1, -…
$ bys87a   <dbl+lbl>  3,  2,  3,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  …
$ bys87e   <dbl+lbl>  2,  2,  1,  2,  2, -9,  2,  2,  2,  2,  1,  2,  2,  3,  …
$ bys87f   <dbl+lbl>  3,  2,  3,  1,  2,  1,  3,  1,  2,  1,  2,  2,  2,  2,  …

First, let’s create an outcome to use, whether a student has an IEP. This is an important measure in disproportionality research (are certain groups of students disproportionately identified as needing special education services). Let’s check the variable labels and get a breakdown of values:

table(els$byiepflg)

  -9   -8   -4    0    1 
7314  179  648 7053 1003 
str(els$byiepflg)
 dbl+lbl [1:16197] 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
 @ label       : chr "Base year Individualized Education Plan"
 @ format.stata: chr "%12.0f"
 @ labels      : Named num [1:5] -9 -8 -4 0 1
  ..- attr(*, "names")= chr [1:5] "{Missing}" "{Survey component legitimate skip/NA}" "{Nonrespondent}" "No" ...

Now, after creating a new variable, iep, we are going to clean up iep using case_when by recoding it to only have two values- a student has an IEP(= 1) or not (= 0).

els.clean <- els %>%
  mutate(across(everything(), ~na_if(.x, -9))) %>% 
  mutate(across(everything(), ~na_if(.x, -7))) %>% 
  mutate(across(everything(), ~na_if(.x, -4))) %>%
  mutate(.,
         iep = case_when(
           byiepflg == 1 ~ 1,
           byiepflg == 0 ~ 0),
         comptest = bytxcstd,
         iep.fac = as_factor(iep),
         race.fac = as_factor(byrace),
         sex.fac = as_factor(bysex),
         comptest.cent = comptest - mean(comptest, na.rm = TRUE)) %>%
  glimpse()
Rows: 16,197
Columns: 61
$ stu_id        <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 101108, …
$ sch_id        <dbl> 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 10…
$ strat_id      <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
$ psu           <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ f1sch_id      <dbl+lbl> 1011, 1011, 1011, 1011, 1011, 1011, 1011,   -8,   -8…
$ f1univ1       <dbl+lbl> 101, 101, 101, 101, 101, 101, 101, 107, 107, 101, 10…
$ f1univ2a      <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ f1univ2b      <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 7, 7, 1, 1, 1, 1, 1, 1, 1, 1, 7…
$ f2univ_p      <dbl+lbl> 103, 101, 101, 101, 101, 101, 101, 104, 105, 101, 10…
$ bystuwt       <dbl+lbl> 178.9513,  28.2951, 589.7248, 235.7822, 178.9513, 25…
$ bysex         <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2…
$ byrace        <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3, 3, 4…
$ bystlang      <dbl+lbl> 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0…
$ byses1        <dbl+lbl> -0.25,  0.58, -0.85, -0.80, -1.41, -1.07,  0.27, -0.…
$ byses1qu      <dbl+lbl> 2, 4, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 1, 2, 4, 3, 3, 1…
$ bygrdrpt      <dbl+lbl>  0,  0,  0, 98,  0,  0, 98,  0,  0, 98,  0,  0,  1, …
$ byhomlit      <dbl+lbl> 0, 3, 2, 1, 1, 2, 1, 3, 0, 2, 0, 3, 2, 1, 2, 0, 1, 0…
$ byparasp      <dbl+lbl> 5, 7, 7, 6, 2, 3, 5, 5, 5, 2, 5, 3, 6, 5, 6, 3, 5, 7…
$ byiepflg      <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ bytxcstd      <dbl+lbl> 56.21, 57.66, 66.50, 46.46, 36.17, 30.72, 45.46, 68.…
$ bytxcqu       <dbl+lbl> 3, 4, 4, 2, 1, 1, 2, 4, 1, 2, 4, 2, 1, 3, 4, 1, 1, 3…
$ bywrtnga      <dbl+lbl>  1.191,  1.191,  0.996, -0.137, -0.435, -0.630, -0.2…
$ byxtracu      <dbl+lbl> 1, 3, 2, 0, 0, 0, 1, 2, 1, 2, 0, 0, 2, 1, 2, 0, 0, 0…
$ byhmwrk       <dbl+lbl>  7,  5, NA, 11, 10, 13, 12, 12, NA, 23,  9,  4,  7, …
$ bytvvigm      <dbl+lbl> 99,  4,  1, 99,  4,  3,  2,  3, NA,  3,  8,  1,  4, …
$ f1qwt         <dbl+lbl>  152.9769,   25.3577,  709.4246,  199.7193,  152.976…
$ f1pnlwt       <dbl+lbl>  155.6312,   25.4906,  725.6926,  205.1919,  155.631…
$ f1trscwt      <dbl+lbl> 284.6529,  42.0937, 730.2420, 280.0837, 287.2398, 32…
$ f2qtscwt      <dbl+lbl>   0.0000,  50.3749, 755.4681, 245.5287, 265.5541, 30…
$ f2qwt         <dbl+lbl>   0.0000,  28.2772, 660.0555, 198.5202, 153.7132, 21…
$ f2f1wt        <dbl+lbl>   0.0000,  29.8462, 713.1807, 227.0875, 163.9503, 22…
$ f2bywt        <dbl+lbl>   0.0000,  27.8257, 662.1222, 212.3764, 158.2162, 21…
$ bys14         <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2…
$ bys15         <dbl+lbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1…
$ bys24d        <dbl+lbl>  1,  1,  1,  1,  1,  2,  2,  1,  2,  1,  1,  1,  2, …
$ bys24e        <dbl+lbl>  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2, …
$ bys24f        <dbl+lbl>  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2, …
$ bys33a        <dbl+lbl>  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, …
$ bys33b        <dbl+lbl>  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, …
$ bys43         <dbl+lbl>  2,  2,  3,  5,  2, 12, NA,  0, NA,  2,  0,  2,  0, …
$ bys62a        <dbl+lbl> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, …
$ bys62g        <dbl+lbl> -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3, …
$ bys66a        <dbl+lbl>  4,  1,  1,  1,  1,  6,  1,  1,  1,  4, NA,  1,  1, …
$ bys66b        <dbl+lbl>  6,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, …
$ bys66c        <dbl+lbl>  3, -1, -1,  6,  1, -3, -1,  6,  1,  6,  1,  1,  1, …
$ bys66f        <dbl+lbl>  1,  1, -1,  1,  1, -3,  1,  1,  1,  1,  1,  1, -3, …
$ bys66g        <dbl+lbl> -3, -3, -1,  1, -3, NA,  1,  1, -3, -1,  1,  1, -3, …
$ bys67         <dbl+lbl> 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0…
$ bys71d        <dbl+lbl>  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0, …
$ bys71e        <dbl+lbl>  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0, …
$ bys83a        <dbl+lbl>  2,  5,  2,  2,  1, NA, -1, -1, -1,  2, -1,  6,  2, …
$ bys83b        <dbl+lbl>  4,  7,  4,  2,  1, NA, -1, -1, -1,  2, -1,  2,  5, …
$ bys87a        <dbl+lbl>  3,  2,  3,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, …
$ bys87e        <dbl+lbl>  2,  2,  1,  2,  2, NA,  2,  2,  2,  2,  1,  2,  2, …
$ bys87f        <dbl+lbl>  3,  2,  3,  1,  2,  1,  3,  1,  2,  1,  2,  2,  2, …
$ iep           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ comptest      <dbl+lbl> 56.21, 57.66, 66.50, 46.46, 36.17, 30.72, 45.46, 68.…
$ iep.fac       <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ race.fac      <fct> "Hispanic, race specified", "Asian, Hawaii/Pac. Islander…
$ sex.fac       <fct> Female, Female, Female, Female, Female, Male, Male, Male…
$ comptest.cent <dbl> 6.656796, 8.106796, 16.946796, -3.093204, -13.383204, -1…

Let’s take a look at the race/ethnicity measure. This is going to be one of our predictors of interest: ## Breakdown of race/ethnicity measure

summary(els.clean$race.fac)
   {Survey component legitimate skip/NA} 
                                     305 
                         {Nonrespondent} 
                                       0 
Amer. Indian/Alaska Native, non-Hispanic 
                                     130 
Asian, Hawaii/Pac. Islander,non-Hispanic 
                                    1460 
 Black or African American, non-Hispanic 
                                    2020 
             Hispanic, no race specified 
                                     996 
                Hispanic, race specified 
                                    1221 
        More than one race, non-Hispanic 
                                     735 
                     White, non-Hispanic 
                                    8682 
                                    NA's 
                                     648 

Breakdown of sex measure

summary(els.clean$sex.fac)
{Survey component legitimate skip/NA}                       {Nonrespondent} 
                                  179                                     0 
                                 Male                                Female 
                                 7653                                  7717 
                                 NA's 
                                  648 

Summary statistics for comptest

And here is comptest, which will be our second predictor of interest. This is a standardized composite achievement test.

describe(els.clean$comptest)
   vars     n  mean    sd median trimmed   mad min   max range  skew kurtosis
X1    1 16197 49.55 12.62  50.74   49.55 10.62  -8 81.04 89.04 -1.68      6.1
    se
X1 0.1

Density plot of comptest scores

kdensity1 <- ggplot(els.clean, aes(x=comptest)) +
geom_density() +
stat_function(fun = dnorm, n = 15892, args = list(mean = 51, sd = 10), color = "red", linetype = "dashed") +
labs(title="Distribution of Achievement Scores",x="Achievement Scores", y = "Density", caption = "N = 15892. The red line indicates a normal distribution. The black lines represents the observed distribution.")
kdensity1

Part 1: Running and interpreting a logistic regression

Running and interpreting the basic model

Do the odds of having an IEP differ by student sex? Do males and females have different odds of having an IEP? Here is the default/ basic command for running a logistic regression. This code will present the results as logit (log odds) coefficients:

logistic1 <- glm(iep.fac ~ sex.fac, family = binomial, data = els.clean)
summary(logistic1)

Call:
glm(formula = iep.fac ~ sex.fac, family = binomial, data = els.clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5966  -0.5966  -0.4260  -0.4260   2.2112  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.63586    0.04287  -38.16   <2e-16 ***
sex.facFemale -0.71808    0.07029  -10.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6054.9  on 8055  degrees of freedom
Residual deviance: 5946.1  on 8054  degrees of freedom
  (8141 observations deleted due to missingness)
AIC: 5950.1

Number of Fisher Scoring iterations: 5

Going deeper with odds ratios

These estimates are logit coefficients. Long story short, negative numbers suggest lower odds relative to the reference group (here - male), and positive numbers suggest higher odds. So here, we would say that female students have significantly lower odds of having an IEP compared to male students. But that’s about all we can say without some more math.

We can exponentiate this coefficient to get an odds ratio- this makes things a little easier to interpret.

female.odds.ratio <- exp(-0.71808)
female.odds.ratio
[1] 0.4876877

When the OR is less than one (.49 in this case), we subtract (1 - OR) which equals .51. This means that for female students, the odds of having an IEP are 51% lower than for male children.

Getting Odds Ratios for a Whole Model

To make things easier when you have more than one predictor, we can exponentiate an entire model. Here, we get ORs and confidence intervals. Any interval that doesn’t contain 1 (notice, not 0) is statistically significant at the .05 level.

exp(cbind(OR = coef(logistic1), confint(logistic1)))
Waiting for profiling to be done...
                     OR     2.5 %    97.5 %
(Intercept)   0.1947857 0.1789419 0.2116958
sex.facFemale 0.4876878 0.4245970 0.5593280

A little fancier: calculating predicted probabilities

We can also get predicted probabilities associated with each estimate. To do this, we need to add up the logits and plug them into the formula below:

# Predicted probability for female student (logit = -1.63586 + -0.71808 = -2.35394)

exp(-2.35394)/(1+exp(-2.35394))
[1] 0.08675311

So, for a female student, the predicted probability of having an IEP is 8.7%.

What about for a male student?

# Predicted probability for a male student (logit = -1.63586)

exp(-1.63586)/(1+exp(-1.63586))
[1] 0.1630292

For a male student, the predicted probability of having an IEP is 16.3%.

Part 2: Adding more predictors

The previous model just included one categorical predictor (sex). Now, let’s try to add a continuous predictor, comptest. We get a slightly different research question: Do the odds of having an IEP differ by student sex, after adjusting for achievement scores in math and reading?

logistic2 <- glm(iep.fac ~ sex.fac + comptest.cent, family = binomial, data = els.clean)
summary(logistic2)

Call:
glm(formula = iep.fac ~ sex.fac + comptest.cent, family = binomial, 
    data = els.clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3397  -0.4654  -0.2768  -0.1608   3.4233  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.199746   0.058871 -37.365   <2e-16 ***
sex.facFemale -0.749426   0.080043  -9.363   <2e-16 ***
comptest.cent -0.135052   0.004545 -29.714   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6054.9  on 8055  degrees of freedom
Residual deviance: 4455.3  on 8053  degrees of freedom
  (8141 observations deleted due to missingness)
AIC: 4461.3

Number of Fisher Scoring iterations: 6

We can still get odds ratios for the whole model

exp(cbind(OR = coef(logistic2), confint(logistic2)))
Waiting for profiling to be done...
                     OR      2.5 %    97.5 %
(Intercept)   0.1108313 0.09858415 0.1241805
sex.facFemale 0.4726377 0.40367216 0.5525011
comptest.cent 0.8736703 0.86581804 0.8813840

And we can do predicted probabilities

# Predicted probability for female student with average comptest scores (logit = -2.494531 + -0.775766 = -3.270297)

exp(-3.270297)/(1+exp(-3.270297))
[1] 0.03660435

So, for a female student with average comptest scores, the predicted probability of having an IEP is 3.7%.

What about for a male student?

# Predicted probability for a male student with average comptest scores (logit = -2.494531)

exp(-2.494531)/(1+exp(-2.494531))
[1] 0.07624247

For a male student with average comptest scores, the predicted probability of having an IEP is 7.6%.

Part 3: Interpreting a multi-category predictor

Now, let’s try the same thing and add on a factor variable with more categories (race.fac): Do the odds of having an IEP differ by student race/ethnicity?

logistic3 <- glm(iep.fac ~ race.fac, family = binomial, data = els.clean)
summary(logistic3)

Call:
glm(formula = iep.fac ~ race.fac, family = binomial, data = els.clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6571  -0.5313  -0.4642  -0.4642   2.3824  

Coefficients:
                                                 Estimate Std. Error z value
(Intercept)                                        1.0809     0.2655   4.071
race.facAmer. Indian/Alaska Native, non-Hispanic  -2.8037     0.4340  -6.460
race.facAsian, Hawaii/Pac. Islander,non-Hispanic  -3.8584     0.3138 -12.298
race.facBlack or African American, non-Hispanic   -2.6013     0.2793  -9.314
race.facHispanic, no race specified               -2.9675     0.2984  -9.944
race.facHispanic, race specified                  -2.5860     0.2855  -9.058
race.facMore than one race, non-Hispanic          -2.6966     0.2987  -9.028
race.facWhite, non-Hispanic                       -3.2545     0.2697 -12.069
                                                 Pr(>|z|)    
(Intercept)                                      4.68e-05 ***
race.facAmer. Indian/Alaska Native, non-Hispanic 1.05e-10 ***
race.facAsian, Hawaii/Pac. Islander,non-Hispanic  < 2e-16 ***
race.facBlack or African American, non-Hispanic   < 2e-16 ***
race.facHispanic, no race specified               < 2e-16 ***
race.facHispanic, race specified                  < 2e-16 ***
race.facMore than one race, non-Hispanic          < 2e-16 ***
race.facWhite, non-Hispanic                       < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6054.9  on 8055  degrees of freedom
Residual deviance: 5801.4  on 8048  degrees of freedom
  (8141 observations deleted due to missingness)
AIC: 5817.4

Number of Fisher Scoring iterations: 5

Get odds ratios for the whole model…

exp(cbind(OR = coef(logistic3), confint(logistic3)))
Waiting for profiling to be done...
                                                         OR      2.5 %
(Intercept)                                      2.94736842 1.78566164
race.facAmer. Indian/Alaska Native, non-Hispanic 0.06058673 0.02470114
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 0.02110124 0.01115324
race.facBlack or African American, non-Hispanic  0.07417582 0.04191157
race.facHispanic, no race specified              0.05143206 0.02803156
race.facHispanic, race specified                 0.07532143 0.04207640
race.facMore than one race, non-Hispanic         0.06743567 0.03674109
race.facWhite, non-Hispanic                      0.03859906 0.02219447
                                                     97.5 %
(Intercept)                                      5.08791603
race.facAmer. Indian/Alaska Native, non-Hispanic 0.13671892
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 0.03831440
race.facBlack or African American, non-Hispanic  0.12598038
race.facHispanic, no race specified              0.09073685
race.facHispanic, race specified                 0.12953533
race.facMore than one race, non-Hispanic         0.11905296
race.facWhite, non-Hispanic                      0.06426743

Using relevel to choose a new reference category for a nominal variable

For equity or disproportionality research, it would usually be better to have white students as the reference category. Often in equity research, we want to know if the rates or odds of some event (IEP, disciplinary referral, etc.) differ for a group of interest compared to the dominant student group. R uses the lowest number (1 in this case) as the reference category by default. To change this, we would use relevel to specify a new reference category.This would make “White, non-Hispanic” your reference group.

els.clean$race.fac = relevel(els.clean$race.fac, ref="White, non-Hispanic")
logistic4 <- glm(iep.fac~race.fac, family = binomial, data = els.clean)
summary(logistic4)

Call:
glm(formula = iep.fac ~ race.fac, family = binomial, data = els.clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6571  -0.5313  -0.4642  -0.4642   2.3824  

Coefficients:
                                                 Estimate Std. Error z value
(Intercept)                                      -2.17361    0.04720 -46.054
race.fac{Survey component legitimate skip/NA}     3.25453    0.26966  12.069
race.facAmer. Indian/Alaska Native, non-Hispanic  0.45085    0.34653   1.301
race.facAsian, Hawaii/Pac. Islander,non-Hispanic -0.60390    0.17372  -3.476
race.facBlack or African American, non-Hispanic   0.65321    0.09874   6.615
race.facHispanic, no race specified               0.28703    0.14423   1.990
race.facHispanic, race specified                  0.66854    0.11505   5.811
race.facMore than one race, non-Hispanic          0.55795    0.14477   3.854
                                                 Pr(>|z|)    
(Intercept)                                       < 2e-16 ***
race.fac{Survey component legitimate skip/NA}     < 2e-16 ***
race.facAmer. Indian/Alaska Native, non-Hispanic 0.193249    
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 0.000509 ***
race.facBlack or African American, non-Hispanic  3.71e-11 ***
race.facHispanic, no race specified              0.046575 *  
race.facHispanic, race specified                 6.22e-09 ***
race.facMore than one race, non-Hispanic         0.000116 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6054.9  on 8055  degrees of freedom
Residual deviance: 5801.4  on 8048  degrees of freedom
  (8141 observations deleted due to missingness)
AIC: 5817.4

Number of Fisher Scoring iterations: 5

Odds ratios

exp(cbind(OR = coef(logistic4), confint(logistic4)))
Waiting for profiling to be done...
                                                         OR      2.5 %
(Intercept)                                       0.1137656  0.1035940
race.fac{Survey component legitimate skip/NA}    25.9073684 15.5599806
race.facAmer. Indian/Alaska Native, non-Hispanic  1.5696429  0.7497972
race.facAsian, Hawaii/Pac. Islander,non-Hispanic  0.5466776  0.3829221
race.facBlack or African American, non-Hispanic   1.9217004  1.5800196
race.facHispanic, no race specified               1.3324694  0.9961699
race.facHispanic, race specified                  1.9513800  1.5516099
race.facMore than one race, non-Hispanic          1.7470807  1.3056327
                                                     97.5 %
(Intercept)                                       0.1246522
race.fac{Survey component legitimate skip/NA}    45.0562748
race.facAmer. Indian/Alaska Native, non-Hispanic  2.9599209
race.facAsian, Hawaii/Pac. Islander,non-Hispanic  0.7580073
race.facBlack or African American, non-Hispanic   2.3273378
race.facHispanic, no race specified               1.7549678
race.facHispanic, race specified                  2.4367763
race.facMore than one race, non-Hispanic          2.3049727

Now, we can put it all together: all predictors we’ve tested before, in one model:

logistic5 <- glm(iep.fac ~ sex.fac + race.fac + comptest.cent, family = binomial, data = els.clean)
summary(logistic5)

Call:
glm(formula = iep.fac ~ sex.fac + race.fac + comptest.cent, family = binomial, 
    data = els.clean)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9710  -0.4435  -0.2439  -0.1273   3.8853  

Coefficients:
                                                  Estimate Std. Error z value
(Intercept)                                      -2.139102   0.068578 -31.192
sex.facFemale                                    -0.759200   0.081446  -9.322
race.fac{Survey component legitimate skip/NA}    -5.997572   0.390574 -15.356
race.facAmer. Indian/Alaska Native, non-Hispanic -0.766918   0.400231  -1.916
race.facAsian, Hawaii/Pac. Islander,non-Hispanic -0.912877   0.190509  -4.792
race.facBlack or African American, non-Hispanic  -0.492592   0.116499  -4.228
race.facHispanic, no race specified              -0.825839   0.166150  -4.970
race.facHispanic, race specified                 -0.534818   0.136820  -3.909
race.facMore than one race, non-Hispanic          0.174215   0.176938   0.985
comptest.cent                                    -0.166020   0.005338 -31.099
                                                 Pr(>|z|)    
(Intercept)                                       < 2e-16 ***
sex.facFemale                                     < 2e-16 ***
race.fac{Survey component legitimate skip/NA}     < 2e-16 ***
race.facAmer. Indian/Alaska Native, non-Hispanic   0.0553 .  
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 1.65e-06 ***
race.facBlack or African American, non-Hispanic  2.35e-05 ***
race.facHispanic, no race specified              6.68e-07 ***
race.facHispanic, race specified                 9.27e-05 ***
race.facMore than one race, non-Hispanic           0.3248    
comptest.cent                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6054.9  on 8055  degrees of freedom
Residual deviance: 4254.5  on 8046  degrees of freedom
  (8141 observations deleted due to missingness)
AIC: 4274.5

Number of Fisher Scoring iterations: 6

Odds ratios

exp(cbind(OR = coef(logistic5), confint(logistic5)))
Waiting for profiling to be done...
                                                          OR      2.5 %
(Intercept)                                      0.117760542 0.10276081
sex.facFemale                                    0.468040816 0.39865887
race.fac{Survey component legitimate skip/NA}    0.002484778 0.00116214
race.facAmer. Indian/Alaska Native, non-Hispanic 0.464442154 0.20158030
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 0.401367640 0.27252187
race.facBlack or African American, non-Hispanic  0.611040741 0.48512204
race.facHispanic, no race specified              0.437867403 0.31385670
race.facHispanic, race specified                 0.585775771 0.44637621
race.facMore than one race, non-Hispanic         1.190311070 0.83603315
comptest.cent                                    0.847029404 0.83808895
                                                      97.5 %
(Intercept)                                      0.134463491
sex.facFemale                                    0.548649129
race.fac{Survey component legitimate skip/NA}    0.005389709
race.facAmer. Indian/Alaska Native, non-Hispanic 0.978742612
race.facAsian, Hawaii/Pac. Islander,non-Hispanic 0.576036671
race.facBlack or African American, non-Hispanic  0.766045615
race.facHispanic, no race specified              0.602405193
race.facHispanic, race specified                 0.763389968
race.facMore than one race, non-Hispanic         1.673898014
comptest.cent                                    0.855816142

Part 4: Checking for Model Fit, Checking Assumptions

Ok, now how about some assumptions? There are a few important assumptions to check when running LR models. We also want to see how well our model “fits” - are our predictions any good? ## Checking for model fit The Hosmer-Lemeshow Goodness of Fit Test is the standard test of model fit. It works ok, but is not always entirely accurate! Notics that first we create a data frame called diagnostics using the same augment command that we learned about last week for checking regression assumptions.

Hosmer-Lemeshow test

library(broom)
diagnostics <- augment(logistic5) %>%
  mutate(.,
         fitted2cat = as_factor(case_when(
           .fitted < 0 ~ 0,
           .fitted > 0 ~ 1
         )),
         ID = row_number())

generalhoslem::logitgof(obs = diagnostics$iep.fac, exp = fitted(logistic5))

    Hosmer and Lemeshow test (binary model)

data:  diagnostics$iep.fac, fitted(logistic5)
X-squared = 16.307, df = 8, p-value = 0.03819

Sensitivity and specificity are also really important statistics to pay attention to. This shows how balanced your model is in its predictions.

Sensitivity

library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
sensitivity(diagnostics$fitted2cat, reference = diagnostics$iep.fac)
[1] 0.9788742

Specificity

specificity(diagnostics$fitted2cat, reference = diagnostics$iep.fac)
[1] 0.3240279

Using a confusion matrix to represent model fit

conf_matrix<-table(diagnostics$fitted2cat,diagnostics$iep.fac)
conf_matrix
   
       0    1
  0 6904  678
  1  149  325

No Multicollinearity: same as a regular regression, use car::vif:

car::vif(logistic5)
                  GVIF Df GVIF^(1/(2*Df))
sex.fac       1.008042  1        1.004013
race.fac      2.086655  7        1.053945
comptest.cent 2.089950  1        1.445666

To check for regression outliers:

ggplot(data = diagnostics, mapping = aes(x = .fitted, y = .cooksd, label = ID)) +
geom_point() + geom_text(nudge_x = .25) +
labs(title = "Cook's Distance Plot for Model",
x = "Predicted Value",
y = "Cook's Distance") +

geom_hline(yintercept = 4/7981, linetype = "dotted", color = "red")