Week 11 Data Dive

General Linear Regression Part 2

Introduction

During Last Week’s Data Dive, we build a Logistic Regression model which is one of the types of General Linear Regression. For this week, we are required to build another GLM. In this data dive we explore the possibility of building a Quasi-Binomial GLM, and if possible, we’ll go ahead and build that model. To check whether a Quasi-Binomial, we check for over dispersion case (i.e. more dispersion in the binary income outcome) in our binomial model.

Checking for Overdispersion

We know that the dispersion statistic is given as:

\[ \text{Dispersion} = \frac{\text{Residual Deviance}}{\text{Degrees of Freedom}} \]

In R we can compute this statistic as follows:

# first we import necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

# Second we load our dataset
df_income <- read.csv("censusincome.csv")

#Convert the income level and gender to numeric
df_income$income_numeric <- ifelse(df_income$income == " >50K", 1, 0)
df_income$gender_numeric <- ifelse(df_income$sex == ' Male', 1, 0)
head(df_income)

#building our logistic regression model
logistic_model <- glm(income_numeric ~ age + hours_per_week + education_num + gender_numeric, data=df_income, family = binomial)

#evaluating the dispersion statistic
dispersion <- summary(logistic_model)$deviance / summary(logistic_model)$df.residual

dispersion

## [1] 0.8575165

We note that the dispersion of our model is 0.8575 which less than 1, indicating that our model does not exhibit overdispersion, but rather it indicates underdispersion contrary to what we thought previously when we were making an argument for a quasi-binomial GLM since generally, underdispersion case is rare.
Here, we note that we don’t have a good case for switching from logistic regression to quasi-binomial since our model did not show overdispersion. However, for learning purposes, we’ll attempt to build a quasi-binomial GLM.

Quasi-Binomial GLM

The Quasi Binomial GLM is an extension of binomial logistic regression that is used to handle overdispersion of binomial data, although we know this isn’t the case in this instance. While binomial logistic regression assumes a binomial variance, the quasi-binomial is more flexible to allow for extra variability in the response variable.
Like standard logistic regression, the quasi binomial uses the logit link function, so the model predicts the log odds of the outcome based on the predictors.

#building the quasi-binomial model
quasi_binomial <- glm(income_numeric ~ age + education_num + hours_per_week + gender_numeric, data=df_income, family = quasibinomial(link = "logit"))

#displaying the coefficients
quasi_binomial$coefficients

##    (Intercept)            age  education_num hours_per_week gender_numeric 
##    -9.13339878     0.04560424     0.35511411     0.03563749     1.16115817

For the coefficients above, we note that the coefficients for the quasi-binomial GLM, are indeed the same as the coefficients below for the logistic regression that we built in the previous data dive. This is in line that with our initial finding that quasi-binomial would not have much of a difference from a logistic regression model, since there is no overdispersion in our target variable.
Since, these coefficients are similar to what we had in Week 10 Data Dive, we are not going to going to explain what they represent, or any insights we gather from this model, as this would be completely repetitive. We however, go ahead to build the relevant diagnostic plots, as this is something we have not done previously for any GLM.

Diagnostic plots

For the diagnostic plots, we’ll perform the following 2 plots:

Residual vs. Fitted Values

# Plot Residuals vs. Fitted Values
plot(quasi_binomial$fitted.values, residuals(quasi_binomial, type = "pearson"),
     main = "Residuals vs Fitted Values", xlab = "Fitted Values", ylab = "Pearson Residuals")
abline(h = 0, col = "red")

From the above plot, we see that the residuals are not distributed around zero, meaning that the linear relationship between the predictor variables and the income response variable, might not be as strong. This indicates that the model is not capturing the relationship in the data effectively.

Residual vs. Predictor Variables.

For the predictor variables in this case, we choose education feature since we previously found out that it is a strong predictor in our model when we look at the model coefficients.

# Residuals vs Predictor: education
plot(df_income$education_num, residuals(quasi_binomial, type = "pearson"),
     main = "Residuals vs education", xlab = "education", ylab = "Pearson Residuals")
abline(h = 0, col = "red")

From the above diagnostic plot, we can see that the residuals are roughly centered around zero, meaning that the model on average neither overpredicts or underpredicts for any education level. However, the vertical spread of these residuals is quite large indicating that the model is not fitting our data quite well, given the predictor variables.

Conclusion

In this week’s data dive, we implemented a quasi-binomial GLM and found out that there was no improvement from logistic regression previously built. This is because, there was no overdispersion of the income response variable. Lastly, we built two diagnostic models, in an effort to investigate the effectiveness of our GLM.