A. Introduction

To what extent do hours worked per week, age, and education level predict a person’s annual income?

This paper evaluates the impact of work hours, age, and education on individual annual compensation using the 2012 American Community Survey (ACS) dataset from OpenIntro.org (2,000 observations). Our analysis focuses specifically on four relevant variables:

Dataset source: OpenIntro ACS12.

B. Data Analysis

We will conduct exploratory data analysis (EDA) and use multiple linear regression to model the relationships between our predictors and annual income.

Chunk 1: Initial Dataset Inspection

acs_data <- read.csv("acs12.csv")
str(acs_data)      # EDA Function 1
## 'data.frame':    2000 obs. of  13 variables:
##  $ income      : int  60000 0 NA 0 0 1700 NA NA NA 45000 ...
##  $ employment  : chr  "not in labor force" "not in labor force" NA "not in labor force" ...
##  $ hrs_work    : int  40 NA NA NA NA 40 NA NA NA 84 ...
##  $ race        : chr  "white" "white" "white" "white" ...
##  $ age         : int  68 88 12 17 77 35 11 7 6 27 ...
##  $ gender      : chr  "female" "male" "female" "male" ...
##  $ citizen     : chr  "yes" "yes" "yes" "yes" ...
##  $ time_to_work: int  NA NA NA NA NA 15 NA NA NA 40 ...
##  $ lang        : chr  "english" "english" "english" "other" ...
##  $ married     : chr  "no" "no" "no" "no" ...
##  $ edu         : chr  "college" "hs or lower" "hs or lower" "hs or lower" ...
##  $ disability  : chr  "no" "yes" "no" "no" ...
##  $ birth_qrtr  : chr  "jul thru sep" "jan thru mar" "oct thru dec" "oct thru dec" ...
summary(acs_data)  # EDA Function 2
##      income        employment           hrs_work         race          
##  Min.   :     0   Length:2000        Min.   : 1.00   Length:2000       
##  1st Qu.:     0   Class :character   1st Qu.:32.00   Class :character  
##  Median :  3000   Mode  :character   Median :40.00   Mode  :character  
##  Mean   : 23600                      Mean   :37.98                     
##  3rd Qu.: 33700                      3rd Qu.:40.00                     
##  Max.   :450000                      Max.   :99.00                     
##  NA's   :377                         NA's   :1041                      
##       age           gender            citizen           time_to_work 
##  Min.   : 0.00   Length:2000        Length:2000        Min.   :  1   
##  1st Qu.:19.75   Class :character   Class :character   1st Qu.: 10   
##  Median :40.00   Mode  :character   Mode  :character   Median : 20   
##  Mean   :40.22                                         Mean   : 26   
##  3rd Qu.:59.00                                         3rd Qu.: 30   
##  Max.   :94.00                                         Max.   :163   
##                                                        NA's   :1217  
##      lang             married              edu             disability       
##  Length:2000        Length:2000        Length:2000        Length:2000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   birth_qrtr       
##  Length:2000       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Chunk 2: Data Cleaning and Filtering

cleaned_data <- acs_data %>%
  select(income, hrs_work, age, edu) %>%                       # dplyr 1
  filter(!is.na(income), !is.na(hrs_work), !is.na(age), 
         edu %in% c("college", "hs or lower")) %>%             # dplyr 2
  mutate(edu = factor(edu, levels = c("hs or lower", "college"))) # dplyr 3

dim(cleaned_data)
## [1] 855   4

Chunk 3: Exploratory Visualizations

ggplot(cleaned_data, aes(x = hrs_work, y = income, color = edu)) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Income vs. Hours Worked", x = "Hours/Week", y = "Income ($)") + theme_minimal()

ggplot(cleaned_data, aes(x = age, y = income, color = edu)) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Income vs. Age", x = "Age", y = "Income ($)") + theme_minimal()

C. Regression Analysis

We fit the following OLS model:

\[Income = \beta_0 + \beta_1(hrs\_work) + \beta_2(age) + \beta_3(edu_{college}) + \epsilon\]

Chunk 4: Model Fitting

income_model <- lm(income ~ hrs_work + age + edu, data = cleaned_data)
summary(income_model)
## 
## Call:
## lm(formula = income ~ hrs_work + age + edu, data = cleaned_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98919 -16340  -5041   8774 315573 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -26455.14    4647.92  -5.692 1.73e-08 ***
## hrs_work      1089.93      89.43  12.187  < 2e-16 ***
## age            315.12      81.02   3.890 0.000108 ***
## educollege   18488.73    2625.43   7.042 3.90e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35290 on 851 degrees of freedom
## Multiple R-squared:  0.224,  Adjusted R-squared:  0.2213 
## F-statistic: 81.89 on 3 and 851 DF,  p-value: < 2.2e-16
confint(income_model)
##                   2.5 %      97.5 %
## (Intercept) -35577.8743 -17332.4045
## hrs_work       914.3884   1265.4618
## age            156.1043    474.1391
## educollege   13335.6531  23641.8009

Coefficient Interpretations

The fitted ordinary least squares (OLS) regression model is: \[\widehat{Income} = -26455.14 + 1089.93(hrs\_work) + 315.12(age) + 18488.73(edu_{college})\]

  • Intercept (\(\beta_0 = -26,455.14\), \(95\%\): The structural mathematical anchor representing the baseline income for an individual who is 0 years old, works 0 hours, and has a high school education or lower.
  • Hours Worked (\(\beta_1 = 1,089.93\), \(95\%\): Controlling for other variables, each additional hour worked per week increases average annual income by $1,089.93 (\(p < 2 \times 10^{-16}\)).
  • Age (\(\beta_2 = 315.12\), \(95\%\) : Controlling for other variables, each additional year of age is associated with an average annual income increase of $315.12 (\(p = 0.000108\)).
  • Education Level (\(\beta_3 = 18,488.73\), \(95\%\): Controlling for age and hours worked, holding a college degree yields an annual income premium of $18,488.73 over a high school education or lower (\(p = 3.90 \times 10^{-12}\)).

D. Model Assumptions and Diagnostics

We evaluate OLS model assumptions on our filtered dataset (\(N = 855\)).

Chunk 5: Diagnostic Plots

par(mfrow = c(2, 2))
plot(income_model)

par(mfrow = c(1, 1))

Chunk 6: Multicollinearity Check

vif(income_model)
## hrs_work      age      edu 
## 1.021944 1.016142 1.007790

Diagnostics Discussion

  • Multicollinearity: Variance Inflation Factors are exceptionally low (hrs_work = 1.0219, age = 1.0161, edu = 1.0078). Because they are close to 1.0, there is no multicollinearity inflating our standard errors.
  • Linearity: The Residuals vs Fitted plot shows a relatively flat line around 0, confirming a linear functional form is acceptable.
  • Normality: The Normal Q-Q plot displays severe upward tail deviation. This heavy right-skewness means the model regularly underpredicts ultra-high earners.
  • Homoscedasticity: The Scale-Location plot displays a distinct fan shape, indicating heteroscedasticity where residual variance expands at higher income levels.

E. Conclusion and Future Directions

Key Findings & Fit

The model confirms weekly work hours, age, and college education are highly significant drivers of annual income (\(F = 81.89\), \(p < 2.2 \times 10^{-16}\)). The Adjusted \(R^2\) is 0.2213, meaning the model explains 22.13% of the variance in annual income across the 855 individuals. The remaining unexplained variance highlights the limitations of using a basic linear approach on right-skewed salary distributions.

F. References