For binary term, we’ll add gender because we want to see the effect of this feature on our model - also, our earlier analysis in previous data dives showed variability in income level based on gender.
For the interaction term, we can either have age or education. To determine which one to choose, we evaluate their degree of multicollinearity with each other as shown below:
#import necesary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(ggplot2)
library(broom)
library(lindia)
library(boot)
#load our census income dataset
df_income = read.csv("./censusincome.csv")
#evaluating multicollinearity
corr_matrix = cor(df_income[, c('hours_per_week', 'age', 'education_num')])
corr_matrix
## hours_per_week age education_num
## hours_per_week 1.00000000 0.06875571 0.14812273
## age 0.06875571 1.00000000 0.03652719
## education_num 0.14812273 0.03652719 1.00000000From the correlation matrix above, we see that multicollinearity is likely low, among the variables we’re considering to use as our predictor variables to improve our linear regression model.
However, it makes more sense to use age as an interaction term because, this allows the model to capture the effect of hours_per_week has on income level depending on age (age is typically associated with more experience)
With the four predictor variables, we proceed to build the linear regression model as shown below:
#first we convert gender column into numeric
df_income$sex_binary <- ifelse(df_income$sex == ' Male', 1,0)
#we also convert income colum
df_income$income_numeric <- ifelse(df_income$income ==' >50K', 1, 0)
model_updated <- lm(income_numeric~hours_per_week * age + sex_binary + education_num, data=df_income)
model_results <- tidy(model_updated, conf.int=TRUE)
model_results
summary(model_updated)
##
## Call:
## lm(formula = income_numeric ~ hours_per_week * age + sex_binary +
## education_num, data = df_income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3259 -0.2578 -0.1213 0.1951 1.2700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.768e-01 1.903e-02 -30.307 < 2e-16 ***
## hours_per_week -1.700e-03 4.643e-04 -3.661 0.000252 ***
## age 6.301e-04 4.122e-04 1.529 0.126345
## sex_binary 1.505e-01 4.584e-03 32.826 < 2e-16 ***
## education_num 5.093e-02 8.230e-04 61.891 < 2e-16 ***
## hours_per_week:age 1.577e-04 1.074e-05 14.684 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3776 on 32555 degrees of freedom
## Multiple R-squared: 0.2204, Adjusted R-squared: 0.2202
## F-statistic: 1840 on 5 and 32555 DF, p-value: < 2.2e-16From the above results, we note the following:
a residual standard error of 0.3776 which indicates the average deviation of the observed values from the values on the regression line. This value is relatively high and indicates a room for improvement. However, we see a significant improvement from a residual standard error of 0.4162 attained by model in Week 8 Data Dive.
a R-squared value of 22.04 showing that the model explains 22.04% of the variance in income level in our data set.
F-statistic of 1840 and a p-value of \(< \mathbf{2.2e-16}\) indicating that the model as a whole is statistically significant and does make significant improvements on the base model i.e. the sample proportions in income level.
Except for age which has a p-value of 0.126, the other predictor variables have a p-value way less than 0.05 indicating their significance in the model.
the hours-per-week has a negative coefficient of -0.0017, implying that working more hours is associated with moving away slightly from being in the above 50K income level, holding the other factors constant. We note that this relationship is contrary to Week 8 Model, where we saw a slightly positive relationship between income and hours_per_week. One way to explain this phenomenon is using Simpson’s paradox where the relationship between a predictor variable and target variable is reversed/changed upon introduction of more predictor variables.
Positive and relatively higher value for “sex” coefficient indicating that male tend to have a higher income compared to females if all the other factors were held constant. Also, the education feature has a positive and large coefficient too indicating that higher levels of education equate to higher probability of being in the above 50K income bracket.
Lastly, we notice a positive coefficient of 0.00015 for our interaction term, hours_per_week:age, implying that the term is significant suggesting that the relationship between hours_per_week and income_level changes significantly based on age.
gg_resfitted(model_updated) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plots <- gg_resX(model_updated, plot.all = FALSE)
plots$age
plots$hours_per_week
plots$education_num
plots$sex_binary
gg_reshist(model_updated)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_qqplot(model_updated)
gg_cooksd(model_updated, threshold = "matlab")
Our Cook’s D plot shows points with higer Cook’s D when compared to the rest of the observations, the notable one being the observation with a value of 15357. This shows that such observation and other several observations that are way above the horizontal threshold red line, have significant influence on the model’s coefficients. Removing those points would change the behavior of our model significantly.
The several observations with relatively higher Cook’s D could mean that they represent unique subgroups within our census income data set e.g. top earners individuals that our model might not be able to handle.
In this Data Dive, we improved our model’s performance by introducing both an interactive term, and a binary term, as well as adding another predictor variable. This resulted in a higher value for R-squared and reduced residual standard error. We also experienced Simpson’s paradox in effect in the relationship of one of our predictor variables and the target variable. Finally, we made several diagnostic plots to investigate whether the model was following the assumptions of linear regression.
We note that since our target variable is binary, linear regression might not be the most suitable tool for building a model. In a matter of fact, our diagnostic plots have shown this to be the case. However, we understand the requirement for this data dive was to use a linear regression - that’s why we proceeded accordingly.