Week 8 Data Dive

Linear Regression Diagnostics

Introduction

For last week’s data dive we built a linear regression model using hours_per_week as the predictor variable and the income as the target variable. We however, noticed that our model had a low R-square value - only 5.3% of the variability in income category could be explained by the hours_per_week feature. To improve on this, we’ll add both an interaction term and a binary term.

Adding Interaction and Binary Term

For binary term, we’ll add gender because we want to see the effect of this feature on our model - also, our earlier analysis in previous data dives showed variability in income level based on gender.

For the interaction term, we can either have age or education. To determine which one to choose, we evaluate their degree of multicollinearity with each other as shown below:

#import necesary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(ggplot2)
library(broom)
library(lindia)
library(boot)

#load our census income dataset
df_income = read.csv("./censusincome.csv")

#evaluating multicollinearity 
corr_matrix = cor(df_income[, c('hours_per_week', 'age', 'education_num')])
corr_matrix

##                hours_per_week        age education_num
## hours_per_week     1.00000000 0.06875571    0.14812273
## age                0.06875571 1.00000000    0.03652719
## education_num      0.14812273 0.03652719    1.00000000

From the correlation matrix above, we see that multicollinearity is likely low, among the variables we’re considering to use as our predictor variables to improve our linear regression model.
However, it makes more sense to use age as an interaction term because, this allows the model to capture the effect of hours_per_week has on income level depending on age (age is typically associated with more experience)

Building the Linear Regression Model

With the four predictor variables, we proceed to build the linear regression model as shown below:
```
#first we convert gender column into numeric
df_income$sex_binary <- ifelse(df_income$sex == ' Male', 1,0)

#we also convert income colum
df_income$income_numeric <- ifelse(df_income$income ==' >50K', 1, 0)
```
```
model_updated <- lm(income_numeric~hours_per_week * age + sex_binary + education_num, data=df_income)

model_results <- tidy(model_updated, conf.int=TRUE)
model_results
```
```
summary(model_updated)
```
```
## 
## Call:
## lm(formula = income_numeric ~ hours_per_week * age + sex_binary + 
##     education_num, data = df_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3259 -0.2578 -0.1213  0.1951  1.2700 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.768e-01  1.903e-02 -30.307  < 2e-16 ***
## hours_per_week     -1.700e-03  4.643e-04  -3.661 0.000252 ***
## age                 6.301e-04  4.122e-04   1.529 0.126345    
## sex_binary          1.505e-01  4.584e-03  32.826  < 2e-16 ***
## education_num       5.093e-02  8.230e-04  61.891  < 2e-16 ***
## hours_per_week:age  1.577e-04  1.074e-05  14.684  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3776 on 32555 degrees of freedom
## Multiple R-squared:  0.2204, Adjusted R-squared:  0.2202 
## F-statistic:  1840 on 5 and 32555 DF,  p-value: < 2.2e-16
```
From the above results, we note the following:
- a residual standard error of 0.3776 which indicates the average deviation of the observed values from the values on the regression line. This value is relatively high and indicates a room for improvement. However, we see a significant improvement from a residual standard error of 0.4162 attained by model in Week 8 Data Dive.
- a R-squared value of 22.04 showing that the model explains 22.04% of the variance in income level in our data set.
- F-statistic of 1840 and a p-value of $< \mathbf{2.2e-16}$ indicating that the model as a whole is statistically significant and does make significant improvements on the base model i.e. the sample proportions in income level.
- Except for age which has a p-value of 0.126, the other predictor variables have a p-value way less than 0.05 indicating their significance in the model.
- the hours-per-week has a negative coefficient of -0.0017, implying that working more hours is associated with moving away slightly from being in the above 50K income level, holding the other factors constant. We note that this relationship is contrary to Week 8 Model, where we saw a slightly positive relationship between income and hours_per_week. One way to explain this phenomenon is using Simpson’s paradox where the relationship between a predictor variable and target variable is reversed/changed upon introduction of more predictor variables.
- Positive and relatively higher value for “sex” coefficient indicating that male tend to have a higher income compared to females if all the other factors were held constant. Also, the education feature has a positive and large coefficient too indicating that higher levels of education equate to higher probability of being in the above 50K income bracket.
- Lastly, we notice a positive coefficient of 0.00015 for our interaction term, hours_per_week:age, implying that the term is significant suggesting that the relationship between hours_per_week and income_level changes significantly based on age.

Diagnostic Plots

For the diagnostic plots, we perform the following plots:

Residual vs. Fitted Values

gg_resfitted(model_updated) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the above plot we see that the variance of the residuals does NOT follow the residual horizontal line at zero as it should be for the ideal case as per one of the assumptions of linear regression. This implies that our model may not fully capture the relationship between the predictor variables and the the target variable. This agrees with our earlier finding where we found that our model does not explain a large variance of income level in our data set.

Residuals vs. Predictor Values

plots <- gg_resX(model_updated, plot.all = FALSE)

plots$age

plots$hours_per_week

plots$education_num

plots$sex_binary

For residual versus age, we don’t see the normal distribution of residuals along the residual line - rather we see the errors mostly being evenly distributed above and below the zero residual line, and thus violating one of the assumptions of linear regression. The same behavior is replicated across residuals vs. hours_per_week, education feature, and gender features - they’ll “behave” as if the zero residual line does not exist. This means that our model may not capture the variability in our data effectively.

Residual Histogram

gg_reshist(model_updated)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the Residual Histogram above, we see that the distribution of residuals is bimodal, where we have one peak around 0, and then another one around 0.5. For linear regression we would expect a unimodal normal distribution which is not the case here, hence this assumption is violated, confirming our earlier findings. However, with our target variable being binary, normality of the residuals isn’t expected in this case since our data is not continuous.

QQ Plots

gg_qqplot(model_updated)

In the above plot, there is a noticeable S-Shaped curve, and this indicates a possible heavy tails or non-linearity distribution. The reference line is the red line where the points would lie on for a normal distribution. For our case we see points start above the reference line, then dip below it at the middle, and then above it again towards the end. This essentially confirms our finding that the residuals are not normally distributed.

gg_cooksd(model_updated, threshold = "matlab")

Our Cook’s D plot shows points with higer Cook’s D when compared to the rest of the observations, the notable one being the observation with a value of 15357. This shows that such observation and other several observations that are way above the horizontal threshold red line, have significant influence on the model’s coefficients. Removing those points would change the behavior of our model significantly.
The several observations with relatively higher Cook’s D could mean that they represent unique subgroups within our census income data set e.g. top earners individuals that our model might not be able to handle.

Conclusion

In this Data Dive, we improved our model’s performance by introducing both an interactive term, and a binary term, as well as adding another predictor variable. This resulted in a higher value for R-squared and reduced residual standard error. We also experienced Simpson’s paradox in effect in the relationship of one of our predictor variables and the target variable. Finally, we made several diagnostic plots to investigate whether the model was following the assumptions of linear regression.
We note that since our target variable is binary, linear regression might not be the most suitable tool for building a model. In a matter of fact, our diagnostic plots have shown this to be the case. However, we understand the requirement for this data dive was to use a linear regression - that’s why we proceeded accordingly.