Week 10 Data Dive

General Linear Model (GLM)

Introduction

In this data dive, we are going to implement a specific type of General Linear Model which is Logistic Regression Model on our Census Income data set. In earlier data dives, we noted that Logistic Regression would model our data set better as opposed to linear regression since our target variable is of binary nature. i.e either above 50K income or below.

Building the Logistic Model

For our logistic model, we select the income level feature as our target variable, and then we can use gender, age, education level, and hours per week as our four explanatory variables.
First we load the necessary libraries and our census income data set too as shown below:

#load the necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(lindia)
library(boot)

#load the census income data set
df_income <- read.csv("./censusincome.csv")
head(df_income)

We then convert gender and income level features to numeric as follows:

#Convert the income level and gender to numeric
df_income$income_numeric <- ifelse(df_income$income == " >50K", 1, 0)
df_income$gender_numeric <- ifelse(df_income$sex == ' Male', 1, 0)
head(df_income)

Finally, we build our logistic model as shown below using the glm function in R and by providing family = binomial to specify that we are building a logistic regression model.

#building the logistic model
logistic_model <- glm(income_numeric ~ age + hours_per_week + education_num + gender_numeric, data=df_income, family = binomial)

#displaying a summary of the model
logistic_model$coefficients

##    (Intercept)            age hours_per_week  education_num gender_numeric 
##    -9.13339878     0.04560424     0.03563749     0.35511411     1.16115817

The coefficients above show the effect of each predictor on the log-odds of having income above 50k. We have an intercept of -9.1334 which represents the base line log-odds of having income level above 50K for the case when all predictor variables are equal to zero.
Age predictor: For each addition year of age, the log-odds of having income above 50K income level increases by 0.0456, holding other variables constant. In terms of odds ratio, the odds of having income above 50K increases by 4.7% i.e \(e^{0.0456}\).
hours_per_week: For each additional hour per week, the log odds of having income greater above 50K income level increases by 0.0356, while holding the other predictors constant. This translates to having an odds ratio of 3.6% meaning that the odds of being in the above 50K income level increases by 3.6% for every additional increase in one hour worked.
Education: This predictor has a log-odds value of 0.3551 meaning that for each additional educational level, the log odds of earning 50K and above increases by 0.3551 translating to a 42.6% increase in the odds, while keeping the other factors constant.
Gender: Being male is coded as “1” and “0” for female gender. The log-odds for this predictor is 1.1611 meaning that the log odds of earning above 50K increases by 1.1611 for male, translating to an odds increase of 3.194. This indicates that males have a 3.19 odds of having an income above 50K while compared to females, if you kept the other factors constant.
In summary, we see that of all the predictors, an increase in education level, results in the greatest increase in odds of earning 50K and above. This makes logical sense since education is typically a big influence on how much someone earns.

Confidence Interval

For the confidence interval, we can pick the education coefficient since it’s the strongest predictor of our model, and evaluate it’s CI. We use the formula shown below:

\[ CI = \hat{\beta} \pm Z \cdot \text{SE}(\hat{\beta}) \]

where \(\hat{\beta}\) is the estimated coefficient for education level in this case, and \(Z\) is the critical value from the standard normal distribution. For a 95% confidence level, which is the CI we desire to build, the value of the corresponding \(Z\) is approximately 1.96. Lastly, \(\text{SE}(\hat{\beta})\) is standard error of the coefficient.

model_summary <- summary(logistic_model)

#evaluating both the estimate (Beta) and the standard error
education_se <-model_summary$coefficients['education_num', ]
education_se

##     Estimate   Std. Error      z value     Pr(>|z|) 
##  0.355114111  0.006617087 53.666229210  0.000000000

From our logistic model, we see we have \(\hat{\beta} = 0.35511\) and \(SE(\hat{\beta}) = 0.006617\). Substituting these values we get \(\text{CI} = (0.3421, 0.3681)\) for the log-odds of the education predictor. This translates to to \(\text{CI} = (40.7\%, 44.5\%)\) for the odds ratio. This means that we are \(95\%\) confident that for every addition education unit, there is an appoximately \(40.7\%\) - \(44.5\%\) increase in the odds of earning 50K and above, while of course holding the other factors constant.

Conclusion

In this data dive, we were able to build a logistic regression model which is a type of General Linear Model (GLM), and interpret the coefficients of the model. We also noted the significant influence of the education predictor on income level. Lastly, we went ahead to evaluate the confidence interval for the education coefficient in our model.