Week 8 Data Dive

Introduction

In this Week’s Data Dive, we are going carry out ANOVA test across various education levels with income level as the target variable. Additionally, we’ll also build a linear regression model of the response using continuous variable, and then evaluate the fitness of the model. For each of the two tasks, we’ll explain to the reader the insights we have gathered.

Setting up

#Loading the necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

#Loading our census income dataset
df_income <- read.csv("./censusincome.csv")

ANOVA

Given the context of our data set, the income category is the most valuable, and we’ll use this as a the response variable. While, it is not continuous, we can still use it as it is ordered i.e. income above and below 50K.
For the explanatory one, we can select the education column as it is categorical. However, as we see can from the results below we have 16 columns for the education category. We can consolidate the columns to a fewer number in order to make our analysis easier and more meaningful.

length(unique(df_income$education))

## [1] 16

unique(df_income$education)

##  [1] " Bachelors"    " HS-grad"      " 11th"         " Masters"     
##  [5] " 9th"          " Some-college" " Assoc-acdm"   " Assoc-voc"   
##  [9] " 7th-8th"      " Doctorate"    " Prof-school"  " 5th-6th"     
## [13] " 10th"         " 1st-4th"      " Preschool"    " 12th"

We’ll consolidate the education variable into “Basic”, “Intermediate”, “Advanced”, and “Professional” education levels as shown below:

df_income <- df_income |>
  mutate(education_consolidated = case_when(
    education %in% c(" Preschool", " 1st-4th", " 5th-6th", " 7th-8th") ~ " Basic",
    education %in% c(" 9th", " 10th", " 11th", " 12th", " HS-grad") ~ " Intermediate",
    education %in% c(" Assoc-acdm", " Assoc-voc", " Some-college", " Bachelors") ~ " Advanced",
    education %in% c(" Masters", " Doctorate", " Prof-school") ~ " Professional"
  ))

grouped_by_education <- df_income |>
  group_by(education_consolidated) |>
  summarise(count = n())

grouped_by_education

Lastly, we need to convert our income variable into numeric, “1” to represent “>50K” and “0” to represent “<=50K”.

df_income <- df_income |>
  mutate(
    income_numeric = ifelse(df_income$income == " <=50K", 0, 1)
  )

Anova Test

Devising the Hypothesis

Null Hypothesis (\(H_0\)): There is no significant difference in proportions of individuals earning above 50K across the various education levels.
Alternative Hypothesis(\(H_1\)): There is a significant difference in proportions of individuals earning above 50K across the various education levels.

Performing the Anova Test

anova_test <- aov(income_numeric ~ education_consolidated, data=df_income)
summary(anova_test)

##                           Df Sum Sq Mean Sq F value Pr(>F)    
## education_consolidated     3    608  202.55    1234 <2e-16 ***
## Residuals              32557   5345    0.16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above we see an F-Value of 1234 which is very large indicating that there is a huge variation across the education groups i.e. “basic”, “intermediate”, “advanced”, and “professional” levels. We also have a p-value of \(2e-16\) which is very small indicating that education levels have significant effect on income level. With these two observations, we have strong evidence to reject the null hypothesis which stated that there is no significant difference in proportions of individuals earning above 50K for the various education levels.
Below is the F-Statistic for the census income data set:

#no. of unique consolidated education levels
df_education <- length(unique(df_income$education_consolidated)) - 1
df_residuals <- nrow(df_income) - length(unique(df_income$education_consolidated))



ggplot() +
  geom_function(xlim = c(0,10), fun =\(x) df(x, df_education, df_residuals))+
  geom_vline(xintercept = 1, color="orange") +
  labs(title="F Distribution for Census Income Data Set",
       x="F Values", 
       y="Probability Density") +
  theme_minimal()

For the case where the variation within the group is the same as the variation across the group, then the F value is equals to 1 which would have validated the null hypothesis. However, our F-value is so much greater from 1, indicating a huge variation of proportions of individuals above 50K income level across the various education groups.
We however, note that our response variable is binary and this violates one of the assumptions of ANOVA test that assumes the data within each group is continuous and normally distributed. That said, there is no other variable in our data set that would have made much logical sense as a response variable. In light of this, a logistic regression would be a better statistical method to evaluate the variation across the groups against the income level.

Linear Regression Model

For the Linear Regression Model, we choose hours_per_week as our continuous explanatory variable and we maintain income_level as our response variable. We know the output of our model will be as follows:

\[ \hat{y} = \beta_0 + \beta_1 \cdot x \]

where \(\hat{y}\) is the predicted value, \(\beta_0\) is the y-intercept, and \(\beta_1\) is the coefficient for the predictor value (hours-per-week in this case)

#Building the linear regression model
linear_model_hours <- lm(income_numeric ~ hours_per_week, data=df_income)
summary(linear_model_hours)

## 
## Call:
## lm(formula = income_numeric ~ hours_per_week, data = df_income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7066 -0.2373 -0.2373  0.0172  1.0729 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.0808278  0.0078973  -10.23   <2e-16 ***
## hours_per_week  0.0079539  0.0001868   42.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4162 on 32559 degrees of freedom
## Multiple R-squared:  0.05276,    Adjusted R-squared:  0.05273 
## F-statistic:  1813 on 1 and 32559 DF,  p-value: < 2.2e-16

From the above, our linear regression can be represent using the equation shown below:

\[ \begin{align} \hat{y} = -0.080828 + 0.0079539 \; \text{hours-per-week} \end{align} \]

We see a y-intercept of \(\approx -0.081\) on a income scale of 0 to 1 with zero representing 50K and below income level, and 1 representing 50K and above. This would mean someone who works 0 hours per week has an income which is further away from the 50K mark. However, this doesn’t make a lot of logical sense since most people don’t work 0 hours.
We also have a hours per week coefficient of 0.00795 which implies that for every additional hour worked per week, the probability of earning more than 50K increases by approximately 0.008. This suggests that working more hours per week just slightly increases chances of earning more than 50K.
Also, we note the p-value of \(<2.2e-16\) which is really small indicating that the influence of hour-per-week is statistically significant, meaning the hours per week does influence an individual’s income level.
The R-Squared value for our model is approximately 0.053 meaning that only 5.3% of the variability in income category i.e below and above 50k, is explained by the hours-per-week feature.
We have F-statistic value of 1813 which is relatively large indicating that the variance across the education groups overshadows the variance within the groups. This implies that hours per week would be a good predictor of income level.
In explaining whether our model is a good fit, we can use the Residual standard error which is at 0.4162 (on a scale of 0 to 1) indicating that the prediction’s deviate quite a lot from the observed values. This would mean that we might need to consider building a different linear regression using other features, or building a different type of model altogether.
In conclusion, we establish the relationship between income level and the hours-per-week is statistically significant, due to the small value of p and the large value of F-statistic. However, the value of R-squared is relatively small indicating that hours-per-week only explain a small portion of the income level variation. Therefore, we would recommend other variables should be added in conjunction with the hours-per-week feature.