library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 3")

acs12 <- read_csv("acs12.csv")
## Rows: 2000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): employment, race, gender, citizen, lang, married, edu, disability, ...
## dbl (4): income, hrs_work, age, time_to_work
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How do age, hours worked per week, education level, and marital status affect an individual’s annual income in the United States?

Introduction

This project explores how different demographic and work-related factors contribute to annual income in the United States. Income is one of the most important indicators of economic well-being, and understanding what influences income can help reveal broader patterns in employment, opportunity, and socioeconomic status. In this project, I focus on whether age, hours worked per week, education level, and marital status can explain variation in yearly income.

The dataset used is the American Community Survey, 2012 (acs12), which includes 2,000 observations collected by the U.S. Census Bureau. Each row represents one adult living in the U.S., with information on work hours, income, education, citizenship, language, disability status, and more. Because the dataset contains both quantitative and categorical variables, it works well for studying how personal characteristics relate to income. In this project, I aim to build a multiple linear regression model to determine which factors have the strongest association with income and how these factors compare when considered together.

Data Analysis

To begin, I loaded the dataset and examined its structure using functions such as head() and str(). This helped me identify variable types and understand how the data was formatted. I then checked for missing values and filtered out any observations with nonpositive income values, since they could interfere with the regression model. I converted variables such as education level into categorical factors to ensure that it was handled correctly during modeling.

Next, I generated summary statistics for key variables like age, income, and hours worked. I also created an exploratory scatterplot to visualize the relationship between hours worked per week and income. Then, using select, mutate, and filter, I created a cleaned version of the dataset containing only the variables needed for the model.

The variables used in this project are:

head(acs12)
## # A tibble: 6 × 13
##   income employment       hrs_work race    age gender citizen time_to_work lang 
##    <dbl> <chr>               <dbl> <chr> <dbl> <chr>  <chr>          <dbl> <chr>
## 1  60000 not in labor fo…       40 white    68 female yes               NA engl…
## 2      0 not in labor fo…       NA white    88 male   yes               NA engl…
## 3     NA <NA>                   NA white    12 female yes               NA engl…
## 4      0 not in labor fo…       NA white    17 male   yes               NA other
## 5      0 not in labor fo…       NA white    77 female yes               NA other
## 6   1700 employed               40 other    35 female yes               15 other
## # ℹ 4 more variables: married <chr>, edu <chr>, disability <chr>,
## #   birth_qrtr <chr>
str(acs12)
## spc_tbl_ [2,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ income      : num [1:2000] 60000 0 NA 0 0 1700 NA NA NA 45000 ...
##  $ employment  : chr [1:2000] "not in labor force" "not in labor force" NA "not in labor force" ...
##  $ hrs_work    : num [1:2000] 40 NA NA NA NA 40 NA NA NA 84 ...
##  $ race        : chr [1:2000] "white" "white" "white" "white" ...
##  $ age         : num [1:2000] 68 88 12 17 77 35 11 7 6 27 ...
##  $ gender      : chr [1:2000] "female" "male" "female" "male" ...
##  $ citizen     : chr [1:2000] "yes" "yes" "yes" "yes" ...
##  $ time_to_work: num [1:2000] NA NA NA NA NA 15 NA NA NA 40 ...
##  $ lang        : chr [1:2000] "english" "english" "english" "other" ...
##  $ married     : chr [1:2000] "no" "no" "no" "no" ...
##  $ edu         : chr [1:2000] "college" "hs or lower" "hs or lower" "hs or lower" ...
##  $ disability  : chr [1:2000] "no" "yes" "no" "no" ...
##  $ birth_qrtr  : chr [1:2000] "jul thru sep" "jan thru mar" "oct thru dec" "oct thru dec" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   income = col_double(),
##   ..   employment = col_character(),
##   ..   hrs_work = col_double(),
##   ..   race = col_character(),
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   citizen = col_character(),
##   ..   time_to_work = col_double(),
##   ..   lang = col_character(),
##   ..   married = col_character(),
##   ..   edu = col_character(),
##   ..   disability = col_character(),
##   ..   birth_qrtr = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Cleaning the dataset

acs_clean <- acs12 |>
  mutate(
    edu = factor(edu),
    married = factor(married) ) |>
  
  filter(income > 0)

Key summaries

summary(acs_clean$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      50   10575   30000   42844   52000  450000
summary(acs_clean$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   30.00   42.50   42.31   53.75   94.00
summary(acs_clean$hrs_work)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   35.00   40.00   38.19   40.00   99.00

Grouping and Summarizing by Education Level

edu_summary <- acs_clean |>
  group_by(edu) |>
  summarise(
    n = n(),
    mean_income = mean(income),
    median_income = median(income),
    sd_income = sd(income),
    min_income = min(income),
    max_income = max(income)
  )

edu_summary
## # A tibble: 3 × 7
##   edu             n mean_income median_income sd_income min_income max_income
##   <fct>       <int>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 college       244      51983.         40000    48534.         50     360000
## 2 grad           96     102448.         62000   107602.        130     450000
## 3 hs or lower   554      28491.         20000    33814.        100     360000

Regression Model

Because my research question involves a quantitative outcome variable (income) and multiple predictors, a multiple linear regression model is appropriate. This model allows me to see how each variable relates to income while controlling for the others.

model <- lm(income ~ age + hrs_work + edu + married, data = acs_clean)

summary(model)
## 
## Call:
## lm(formula = income ~ age + hrs_work + edu + married, data = acs_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123824  -19756   -5740    9400  322687 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -17113.5     6957.3  -2.460 0.014091 *  
## age               412.9      113.2   3.647 0.000281 ***
## hrs_work         1167.1      122.4   9.534  < 2e-16 ***
## edugrad         44453.1     5738.4   7.747 2.57e-14 ***
## eduhs or lower -19063.9     3654.7  -5.216 2.27e-07 ***
## marriedyes       8943.6     3452.0   2.591 0.009731 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47300 on 888 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2815 
## F-statistic: 70.98 on 5 and 888 DF,  p-value: < 2.2e-16

Interpreting results

The regression results indicate that several predictors have meaningful effects on annual income. The coefficient for hours worked per week is 1,167.1, meaning that for each additional hour worked per week, a person’s yearly income is expected to increase by about 1,167, holding all other variables constant. The coefficient for age is 412.9, showing that income increases by roughly 413 per year of age, likely reflecting greater experience or seniority in the workforce. Education also has strong effects: individuals with a college degree earn about 44,453 dollars more than the baseline education group, while those with a high school education or lower earn about 19,064 dollars less, demonstrating the large financial returns associated with higher education. The coefficient for marital status indicates that being married is associated with earning $8,944 more per year compared to not being married. The adjusted R-squared value of 0.2815 means that the model explains about 28% of the variation in income, which is reasonable for socioeconomic data, where many unmeasured factors influence earnings. The overall model p-value (< 2.2e-16) indicates that the regression is statistically significant as a whole, meaning that the combination of predictors provides a much better explanation of income variation than a model with no predictors. Together, these results show that age, work effort, education, and marital status are all meaningful contributors to income differences in the United States.

Model Diagnostics (linearity, residuals…)

plot(acs_clean$age, acs_clean$income,
     xlab = "Age",
     ylab = "Income",
     main = "Linearity Check: Age vs Income")

abline(model, col = 1 , lwd = 2)
## Warning in abline(model, col = 1, lwd = 2): only using the first two of 6
## regression coefficients

plot(resid(model), 
     type = "b",
     main = "Residual Plot for Multiple Linear Regression",
     ylab = "Residuals")

abline(h = 0, lty = 2)

par(mfrow = c(2,2))
plot(model)

par(mfrow = c(1,1))

Results

The Residuals vs Fitted plot provides information about the linearity assumption. Ideally, residuals should form a random horizontal band around zero with no visible pattern. In this model, the plot shows some curvature and increasing spread in the residuals at higher fitted values, suggesting that the relationship between the predictors and income is not perfectly linear, especially for higher-income individuals. This means the model captures general linear trends but does not fully represent the underlying relationship for all ranges of income.

The Q–Q plot evaluates the normality of residuals. The upper tail of the plot bends sharply upward, indicating that the residuals deviate from normality due to high-income outliers. This violation means the errors are not perfectly normally distributed, which may affect the accuracy of hypothesis tests and confidence intervals, though regression can still be used if the sample size is large.

This plot shows a clear upward trend, meaning the spread of residuals increases as fitted income increases. This indicates heteroscedasticity, suggesting that the model predicts income less consistently for higher-earning individuals. As a result, standard errors may be underestimated for those observations, and caution should be used when interpreting significance levels.

Check Multicollinearity

cor(acs_clean[, c("age", "hrs_work")], use = "complete.obs")
##                age  hrs_work
## age      1.0000000 0.1183107
## hrs_work 0.1183107 1.0000000

Interpretation of Correlation Matrix

The correlation matrix shows that the relationship between age and hours worked per week is very weak, with a correlation of 0.118. This small positive value indicates that older individuals tend to work slightly more hours on average, but the relationship is minimal and not strong enough to suggest meaningful dependence between the two variables.

Conclusion and future studies

The results of this project show that age, hours worked per week, education level, and marital status all meaningfully contribute to differences in annual income in the United States. Hours worked per week has the strongest effect, with each additional hour associated with a substantial increase in yearly earnings. Education also plays a major role: college graduates earn far more than individuals with only a high school education or less, demonstrating the strong economic value of higher education. Age contributes positively as well, likely reflecting accumulated experience in the workforce, and married individuals earn more on average than those who are not married. Altogether, the adjusted R-squared value of 0.2815 indicates that these predictors explain about 28% of the variation in income, a reasonable amount for socioeconomic data where many personal and structural factors influence earnings.

However, the diagnostic checks reveal important limitations. The Residuals vs Fitted plot shows curvature and increasing spread at higher fitted values, the Scale-Location plot indicates heteroscedasticity, and the Q–Q plot shows deviations from normality, especially in the upper tail, caused by high-income outliers. While these violations do not invalidate the model, they indicate that income is difficult to model using a simple linear structure and that interpretations should be made with caution.

Future work could greatly improve the model by incorporating additional variables such as occupation, industry, geographic region, and years of work experience factors that strongly influence income but are not included here. Additionally, including interaction terms—such as education vs hours worked could reveal important combined effects that a basic linear model cannot capture. Despite its limitations, this project provides a valuable first look at how demographic and work-related factors shape income and highlights promising directions for deeper analysis.

References

Dataset Name: American Community Survey, 2012 (acs12) dataset. From openintro.org Source: https://www.census.gov/programs-surveys/acs