Income Differences Across Education Levels in the United States

Author

Gamaliel Ngoaufon

Published

June 28, 2026

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

Do average annual incomes differ significantly across education levels in the United States?

Education is widely believed to influence earning potential. Individuals with higher levels of education often have access to better employment opportunities and higher-paying careers. This project investigates whether differences in average annual income exist among different education groups using data from the 2012 American Community Survey (ACS), a large-scale demographic survey administered by the U.S. Census Bureau and made publicly available through OpenIntro.org.

The dataset contains 2,000 observations on individuals living in the United States and 13 variables. Each row represents one person, and each column represents one characteristic of that individual. The variables include income (annual income in U.S. dollars), employment (employment status: employed, unemployed, or not in labor force), hrs_work (hours worked per week), race (white, black, asian, or other), age (age in years), gender (male or female).

For this analysis, the two variables of primary interest are income and edu. After removing observations with missing or zero income values and missing education values, the cleaned dataset contains 894 observations across 6 variables. The dataset is sourced from OpenIntro.org, available at https://www.openintro.org/data/.

Data Analysis

The analysis begins by cleaning the data and removing observations with missing or zero income values and missing education values. Exploratory data analysis is then performed using summary statistics grouped by education level and five visualizations to examine income distributions within and across education groups. Finally, a one-way ANOVA is used to determine whether average income differs significantly among the three education groups, followed by a Tukey HSD post-hoc test to identify which specific pairs of groups differ.

# import and clean

acs <- read_csv("acs12.csv")
Rows: 2000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): employment, race, gender, citizen, lang, married, edu, disability, ...
dbl (4): income, hrs_work, age, time_to_work

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
acs_clean <- acs |>
  select(income, age, gender, edu, hrs_work, employment) |>
  filter(!is.na(income) & income > 0 & !is.na(edu))

dim(acs_clean)
[1] 894   6
# summary by education

acs_clean |>
  group_by(edu) |>
  summarise(
    Count         = n(),
    Mean_Income   = round(mean(income, na.rm = TRUE), 0),
    Median_Income = median(income, na.rm = TRUE),
    SD            = round(sd(income, na.rm = TRUE), 0)
  )
# A tibble: 3 × 5
  edu         Count Mean_Income Median_Income     SD
  <chr>       <int>       <dbl>         <dbl>  <dbl>
1 college       244       51983         40000  48534
2 grad           96      102448         62000 107602
3 hs or lower   554       28491         20000  33814

The summary table reveals clear differences across the three education groups. Graduate-degree holders earn the highest mean income ($102,448), followed by college graduates ($51,983), and then those with a high school diploma or lower ($28,491). The standard deviations are large within each group, particularly for the graduate group ($107,602), suggesting high variability in individual earnings even among the most educated. These patterns motivate the formal hypothesis test below.

Histogram

# histogram

options(scipen = 999)

ggplot(acs_clean, aes(x = income)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title    = "Distribution of Annual Income",
    subtitle = "ACS 2012 sample",
    x        = "Annual income (USD)",
    y        = "Count"
  ) +
  theme_dark()

The histogram reveals that annual income in this sample is heavily right-skewed, meaning the vast majority of individuals earn relatively modest incomes while a small number earn significantly more. The bulk of the distribution is concentrated below $100,000, with a long tail stretching toward higher income values. This skew is typical of real-world income data and signals that the mean income will be pulled upward by high earners

Education Level Counts

#  edu counts

options(scipen = 999) # for real numbers on the axis

ggplot(acs_clean, aes(x = edu, fill = edu)) +
  geom_bar() +
  labs(
    title    = "Sample Size by Education Level",
    subtitle = "ACS 2012 sample",
    x        = "Education level",
    y        = "Count"
  ) +
  theme_grey() +
  theme(legend.position = "none")

This chart simply shows how many people fall into each education category in the cleaned dataset. The high school or lower group is by far the largest, followed by college graduates, with graduate degree holders making up the smallest group.

Side-by-Side Boxplot

# boxplot

options(scipen = 999)

ggplot(acs_clean, aes(x = edu, y = income, fill = edu)) +
  geom_boxplot() +
  scale_fill_manual(values = c("college"     = "#5B8DB8",
                                "grad"        = "#E07B8A",
                                "hs or lower" = "#6DBF8A")) +
  labs(
    title    = "Annual Income by Education Level",
    subtitle = "ACS 2012 sample",
    x        = "Education level",
    y        = "Annual income (USD)"
  ) +
  theme_dark() +
  theme(legend.position = "none",
        axis.text.x     = element_text(angle = 25, hjust = 1))

This is the most informative chart in the project. It places the income distribution of all three education groups next to each other, making the differences immediately visible. The graduate group’s box sits noticeably higher than the college group’s, which in turn sits higher than the high school or lower group’s.

Scatterplot

# scatter

options(scipen = 999)

ggplot(acs_clean, aes(x = age, y = income, color = edu)) +
  geom_point(alpha = 0.3, size = 6) +
  labs(
    title    = "Age vs. Annual Income by Education Level",
    subtitle = "ACS 2012 sample",
    x        = "Age (years)",
    y        = "Annual income (USD)",
    color    = "Education level"
  ) +
  theme_bw()

The scatterplot introduces age as a second variable and colors each data point by education level, allowing you to see how income evolves with age across the three groups simultaneously. # Statistical Analysis

A one-way ANOVA is the appropriate test here because the goal is to compare a continuous outcome (annual income) across three independent groups (education levels). The hypotheses are stated below, where HS, College, and Grad represent the true mean annual incomes for individuals with a high school diploma or lower, a college degree, and a graduate degree, respectively.

H_0: HS = College = Grad

H_A: at least one is different

The significance level is set at alpha = 0.05.

# anova-test

anova_model <- aov(income ~ edu, data = acs_clean)
summary(anova_model)
             Df        Sum Sq      Mean Sq F value              Pr(>F)    
edu           2  475563492865 237781746432   91.93 <0.0000000000000002 ***
Residuals   891 2304645681998   2586583257                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

The one-way ANOVA produced F(2, 891) = 91.93 with a p-value < 2 × 10⁻¹⁶, which is far below the significance level of α = 0.05. We therefore reject the null hypothesis and conclude that average annual income is not the same across all three education groups at least one group’s mean income differs significantly from the others.

Conclusion

This analysis examined whether average annual income differs significantly across education levels among U.S. adults using the 2012 American Community Survey. The one-way ANOVA yielded F(2, 891) = 91.93 with a p-value far below α = 0.05, providing strong evidence to reject the null hypothesis and conclude that mean income differs significantly across education groups.

These findings are consistent with the widely documented education–earnings premium in labor economics and support the view that higher educational attainment is meaningfully associated with greater earning potential. It is important to note, however, that ANOVA establishes a statistical association, not causation: other factors correlated with education (such as age, occupation type, hours worked, or field of study) may also contribute to the income differences observed here.

Future research could extend this analysis by incorporating additional variables — age, gender, hours worked, race, and employment status — into a multiple regression model to isolate the independent contribution of education to income. It would also be valuable to replicate this analysis using more recent ACS data to examine whether the education–income gap has widened or narrowed in the years since 2012.

References

OpenIntro.org. (2012). American Community Survey [Data set]. https://www.openintro.org/data/

R Core Team. (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686