Introduction
Research Question: Does a higher level of education reduce the likelihood of being unemployed among U.S. adults aged 25 and older?
This project explores the relationship between education and unemployment using data from the U.S. Census Bureau’s Current Population Survey (CPS). The dataset contains thousands of observations on adults, including information such as age, gender, education, employment status, income, and other demographic factors.
For this study, I focus on adults aged 25 and older, as this group typically has completed formal education and is actively participating in the labor market.
The key variables used in the analysis are:
A_AGE: Age of the respondent
A_HGA: Highest grade or degree completed
PEMLR: Employment status (coded to indicate unemployed or employed)
unemployed: A derived binary variable coded 1 if unemployed and 0 if employed
The data are sourced from the CPS Public Use Microdata Sample, available through the U.S. Census Bureau (https://www.census.gov).
Data Analysis
This section includes data cleaning, variable transformation, and exploratory data analysis (EDA) to understand how unemployment varies by education level.
# Load dataset
cps <- read_csv("pppub23.csv")
## Rows: 146133 Columns: 829
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): PERIDNUM
## dbl (828): PH_SEQ, P_SEQ, A_LINENO, PF_SEQ, PHF_SEQ, OED_TYP1, OED_TYP2, OED...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview structure
head(cps)
## # A tibble: 6 × 829
## PERIDNUM PH_SEQ P_SEQ A_LINENO PF_SEQ PHF_SEQ OED_TYP1 OED_TYP2 OED_TYP3 PERRP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1577120… 3 1 1 1 1 0 0 0 40
## 2 1577120… 3 2 2 1 1 0 0 0 51
## 3 4933808… 4 1 1 1 1 0 0 0 40
## 4 4933808… 4 2 2 1 1 0 0 0 42
## 5 9100075… 15 1 1 1 1 0 0 0 41
## 6 6100006… 16 1 1 1 1 0 0 0 40
## # ℹ 819 more variables: PXRRP <dbl>, PXMARITL <dbl>, PXRACE1 <dbl>,
## # PEHSPNON <dbl>, PXHSPNON <dbl>, PEAFEVER <dbl>, PXAFEVER <dbl>,
## # PEAFWHN1 <dbl>, PXAFWHN1 <dbl>, PEAFWHN2 <dbl>, PEAFWHN3 <dbl>,
## # PEAFWHN4 <dbl>, PXSPOUSE <dbl>, PENATVTY <dbl>, PXNATVTY <dbl>,
## # PEMNTVTY <dbl>, PXMNTVTY <dbl>, PEFNTVTY <dbl>, PXFNTVTY <dbl>,
## # PEINUSYR <dbl>, PXINUSYR <dbl>, PEPAR1 <dbl>, PXPAR1 <dbl>, PEPAR2 <dbl>,
## # PXPAR2 <dbl>, PEPAR1TYP <dbl>, PXPAR1TYP <dbl>, PEPAR2TYP <dbl>, …
#names(cps)[1:400]
# this helped me extract the right column names saved on file for age, education level,sex ,income.
Cleaning and Recoding
# Keep only adults age 25 and older
cps_clean <- filter(cps, A_AGE >= 25)
# Recode education levels (A_HGA)
cps_clean$education_level <- ifelse(cps_clean$A_HGA <= 38, "High school or less",
ifelse(cps_clean$A_HGA == 39 | cps_clean$A_HGA == 40, "Some college/Associate",
ifelse(cps_clean$A_HGA == 41, "Bachelor's",
ifelse(cps_clean$A_HGA >= 42, "Graduate", NA))))
# Recode employment status (PEMLR)
# 2 = Unemployed, 1 = Employed
cps_clean$unemployed <- ifelse(cps_clean$PEMLR == 2, 1, 0)
# Remove rows with missing values
cps_clean <- drop_na(cps_clean, education_level, unemployed)
# Display summary of key variables
summary(select(cps_clean, A_AGE, education_level, unemployed))
## A_AGE education_level unemployed
## Min. :25.0 Length:98522 Min. :0.00000
## 1st Qu.:38.0 Class :character 1st Qu.:0.00000
## Median :51.0 Mode :character Median :0.00000
## Mean :51.7 Mean :0.02244
## 3rd Qu.:65.0 3rd Qu.:0.00000
## Max. :85.0 Max. :1.00000
Exploratory Data Analysis (EDA) Proportion of Unemployed by Education Level
# Find average unemployment rate for each education level
unemp_by_edu <- summarise(
group_by(cps_clean, education_level),
mean_unemployed = mean(unemployed)
)
unemp_by_edu
## # A tibble: 4 × 2
## education_level mean_unemployed
## <chr> <dbl>
## 1 Bachelor's 0.0217
## 2 Graduate 0.0276
## 3 High school or less 0.0149
## 4 Some college/Associate 0.0189
Bar Chart: Sample Size by Education Level
ggplot(cps_clean, aes(x = education_level)) +
geom_bar(fill = "#1f77b4") +
labs(
title = "Sample Size by Education Level",
x = "Education Level",
y = "Count"
) +
theme_minimal()
Bar Chart: Unemployment Proportion by Education Level
ggplot(unemp_by_edu, aes(x = education_level, y = mean_unemployed)) +
geom_col(fill = "#ff7f0e") +
labs(
title = "Proportion Unemployed by Education Level",
x = "Education Level",
y = "Unemployment Rate (Proportion)"
) +
theme_minimal()
The exploratory data suggests that unemployment rates tend to decrease as education level increases, indicating a potential relationship between education and job security.
Statistical Analysis
To formally test whether mean unemployment differs by education level, I perform a one-way ANOVA.
Hypotheses
Let μ₁, μ₂, μ₃, and μ₄ represent the mean unemployment proportions for each education level:
μ₁ = High school or less
μ₂ = Some college/Associate
μ₃ = Bachelor’s
μ₄ = Graduate
Null Hypothesis (H₀): μ₁ = μ₂ = μ₃ = μ₄ Alternative Hypothesis (Hₐ): At least one μ differs
The test is conducted at a significance level of α = 0.05.
anova_fit <- aov(unemployed ~ education_level, data = cps_clean)
summary(anova_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## education_level 3 2.2 0.7199 32.84 <2e-16 ***
## Residuals 98518 2159.2 0.0219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Group means for interpretation
cps_clean |>
group_by(education_level) |>
summarise(mean_unemployed = mean(unemployed))
## # A tibble: 4 × 2
## education_level mean_unemployed
## <chr> <dbl>
## 1 Bachelor's 0.0217
## 2 Graduate 0.0276
## 3 High school or less 0.0149
## 4 Some college/Associate 0.0189
Interpreting Results
If p < 0.05, reject H₀: There is significant evidence that unemployment differs across education groups.
If p ≥ 0.05, fail to reject H₀: The data do not show strong evidence of differences.
Conclusion and Future Directions
This analysis examined whether education level affects unemployment among adults aged 25 and older using CPS data. The exploratory analysis revealed that individuals with higher education tend to have lower unemployment rates. The ANOVA test evaluated whether these differences are statistically significant.
If the ANOVA p-value is below 0.05, it suggests that at least one education group’s unemployment rate differs from others — supporting the idea that education plays a key role in reducing unemployment risk.
Future research could extend this analysis using logistic regression, adding predictors such as region, gender, race, and industry to model unemployment probabilities more precisely. It would also be valuable to track these patterns over time to observe whether the education–employment relationship remains stable or changes with economic cycles.
References
U.S. Census Bureau. (2023). Current Population Survey (CPS): Education and Employment Data [Data set]. Retrieved from https://www.census.gov