\[ \text{Dispersion} = \frac{\text{Residual Deviance}}{\text{Degrees of Freedom}} \]
# first we import necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
# Second we load our dataset
df_income <- read.csv("censusincome.csv")
#Convert the income level and gender to numeric
df_income$income_numeric <- ifelse(df_income$income == " >50K", 1, 0)
df_income$gender_numeric <- ifelse(df_income$sex == ' Male', 1, 0)
head(df_income)
#building our logistic regression model
logistic_model <- glm(income_numeric ~ age + hours_per_week + education_num + gender_numeric, data=df_income, family = binomial)
#evaluating the dispersion statistic
dispersion <- summary(logistic_model)$deviance / summary(logistic_model)$df.residual
dispersion
## [1] 0.8575165
We note that the dispersion of our model is 0.8575 which less than 1, indicating that our model does not exhibit overdispersion, but rather it indicates underdispersion contrary to what we thought previously when we were making an argument for a quasi-binomial GLM since generally, underdispersion case is rare.
Here, we note that we don’t have a good case for switching from logistic regression to quasi-binomial since our model did not show overdispersion. However, for learning purposes, we’ll attempt to build a quasi-binomial GLM.
The Quasi Binomial GLM is an extension of binomial logistic regression that is used to handle overdispersion of binomial data, although we know this isn’t the case in this instance. While binomial logistic regression assumes a binomial variance, the quasi-binomial is more flexible to allow for extra variability in the response variable.
Like standard logistic regression, the quasi binomial uses the logit link function, so the model predicts the log odds of the outcome based on the predictors.
#building the quasi-binomial model
quasi_binomial <- glm(income_numeric ~ age + education_num + hours_per_week + gender_numeric, data=df_income, family = quasibinomial(link = "logit"))
#displaying the coefficients
quasi_binomial$coefficients
## (Intercept) age education_num hours_per_week gender_numeric
## -9.13339878 0.04560424 0.35511411 0.03563749 1.16115817
For the coefficients above, we note that the coefficients for the quasi-binomial GLM, are indeed the same as the coefficients below for the logistic regression that we built in the previous data dive. This is in line that with our initial finding that quasi-binomial would not have much of a difference from a logistic regression model, since there is no overdispersion in our target variable.
Since, these coefficients are similar to what we had in Week 10 Data Dive, we are not going to going to explain what they represent, or any insights we gather from this model, as this would be completely repetitive. We however, go ahead to build the relevant diagnostic plots, as this is something we have not done previously for any GLM.
# Plot Residuals vs. Fitted Values
plot(quasi_binomial$fitted.values, residuals(quasi_binomial, type = "pearson"),
main = "Residuals vs Fitted Values", xlab = "Fitted Values", ylab = "Pearson Residuals")
abline(h = 0, col = "red")
# Residuals vs Predictor: education
plot(df_income$education_num, residuals(quasi_binomial, type = "pearson"),
main = "Residuals vs education", xlab = "education", ylab = "Pearson Residuals")
abline(h = 0, col = "red")