To what extent do hours worked per week, age, and education level predict a person’s annual income?
This paper evaluates the impact of work hours, age, and education on individual annual compensation using the 2012 American Community Survey (ACS) dataset from OpenIntro.org (2,000 observations). Our analysis focuses specifically on four relevant variables:
income: Total annual income (Continuous
Outcome)
hrs_work: Weekly hours worked (Quantitative
Predictor)
age: Age in years (Quantitative Predictor)
edu: Highest education level attained, restricted to
“college” or “hs or lower” (Categorical Predictor)
Dataset source: OpenIntro ACS12.
We will conduct exploratory data analysis (EDA) and use multiple linear regression to model the relationships between our predictors and annual income.
acs_data <- read.csv("acs12.csv")
str(acs_data) # EDA Function 1
## 'data.frame': 2000 obs. of 13 variables:
## $ income : int 60000 0 NA 0 0 1700 NA NA NA 45000 ...
## $ employment : chr "not in labor force" "not in labor force" NA "not in labor force" ...
## $ hrs_work : int 40 NA NA NA NA 40 NA NA NA 84 ...
## $ race : chr "white" "white" "white" "white" ...
## $ age : int 68 88 12 17 77 35 11 7 6 27 ...
## $ gender : chr "female" "male" "female" "male" ...
## $ citizen : chr "yes" "yes" "yes" "yes" ...
## $ time_to_work: int NA NA NA NA NA 15 NA NA NA 40 ...
## $ lang : chr "english" "english" "english" "other" ...
## $ married : chr "no" "no" "no" "no" ...
## $ edu : chr "college" "hs or lower" "hs or lower" "hs or lower" ...
## $ disability : chr "no" "yes" "no" "no" ...
## $ birth_qrtr : chr "jul thru sep" "jan thru mar" "oct thru dec" "oct thru dec" ...
summary(acs_data) # EDA Function 2
## income employment hrs_work race
## Min. : 0 Length:2000 Min. : 1.00 Length:2000
## 1st Qu.: 0 Class :character 1st Qu.:32.00 Class :character
## Median : 3000 Mode :character Median :40.00 Mode :character
## Mean : 23600 Mean :37.98
## 3rd Qu.: 33700 3rd Qu.:40.00
## Max. :450000 Max. :99.00
## NA's :377 NA's :1041
## age gender citizen time_to_work
## Min. : 0.00 Length:2000 Length:2000 Min. : 1
## 1st Qu.:19.75 Class :character Class :character 1st Qu.: 10
## Median :40.00 Mode :character Mode :character Median : 20
## Mean :40.22 Mean : 26
## 3rd Qu.:59.00 3rd Qu.: 30
## Max. :94.00 Max. :163
## NA's :1217
## lang married edu disability
## Length:2000 Length:2000 Length:2000 Length:2000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## birth_qrtr
## Length:2000
## Class :character
## Mode :character
##
##
##
##
cleaned_data <- acs_data %>%
select(income, hrs_work, age, edu) %>% # dplyr 1
filter(!is.na(income), !is.na(hrs_work), !is.na(age),
edu %in% c("college", "hs or lower")) %>% # dplyr 2
mutate(edu = factor(edu, levels = c("hs or lower", "college"))) # dplyr 3
dim(cleaned_data)
## [1] 855 4
ggplot(cleaned_data, aes(x = hrs_work, y = income, color = edu)) +
geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +
labs(title = "Income vs. Hours Worked", x = "Hours/Week", y = "Income ($)") + theme_minimal()
ggplot(cleaned_data, aes(x = age, y = income, color = edu)) +
geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +
labs(title = "Income vs. Age", x = "Age", y = "Income ($)") + theme_minimal()
We fit the following OLS model:
\[Income = \beta_0 + \beta_1(hrs\_work) + \beta_2(age) + \beta_3(edu_{college}) + \epsilon\]
income_model <- lm(income ~ hrs_work + age + edu, data = cleaned_data)
summary(income_model)
##
## Call:
## lm(formula = income ~ hrs_work + age + edu, data = cleaned_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98919 -16340 -5041 8774 315573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26455.14 4647.92 -5.692 1.73e-08 ***
## hrs_work 1089.93 89.43 12.187 < 2e-16 ***
## age 315.12 81.02 3.890 0.000108 ***
## educollege 18488.73 2625.43 7.042 3.90e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35290 on 851 degrees of freedom
## Multiple R-squared: 0.224, Adjusted R-squared: 0.2213
## F-statistic: 81.89 on 3 and 851 DF, p-value: < 2.2e-16
confint(income_model)
## 2.5 % 97.5 %
## (Intercept) -35577.8743 -17332.4045
## hrs_work 914.3884 1265.4618
## age 156.1043 474.1391
## educollege 13335.6531 23641.8009
The fitted ordinary least squares (OLS) regression model is: \[\widehat{Income} = -26455.14 + 1089.93(hrs\_work) + 315.12(age) + 18488.73(edu_{college})\]
We evaluate OLS model assumptions on our filtered dataset (\(N = 855\)).
par(mfrow = c(2, 2))
plot(income_model)
par(mfrow = c(1, 1))
vif(income_model)
## hrs_work age edu
## 1.021944 1.016142 1.007790
hrs_work = 1.0219,
age = 1.0161, edu = 1.0078). Because they are
close to 1.0, there is no multicollinearity inflating our standard
errors.The model confirms weekly work hours, age, and college education are highly significant drivers of annual income (\(F = 81.89\), \(p < 2.2 \times 10^{-16}\)). The Adjusted \(R^2\) is 0.2213, meaning the model explains 22.13% of the variance in annual income across the 855 individuals. The remaining unexplained variance highlights the limitations of using a basic linear approach on right-skewed salary distributions.