Introduction
Research question: Is there an association between education level and unemployment status among U.S. adults aged 25 and older?
In this project I use data from the Current Population Survey (CPS) public-use microdata file pppub23.csv. The CPS is a large national survey run by the U.S. Census Bureau that collects information on labor force participation, demographics, and education for people living in the United States. The raw dataset contains more than 100,000 records and many variables.
For this analysis, I focus on three variables:
A_AGE – age of the respondent
A_HGA – highest grade or degree completed
PEMLR – labor force status
Because most people have completed their formal education by age 25, I restrict the sample to adults aged 25 and older. I then recode A_HGA into four education groups and PEMLR into a simple employed/unemployed variable. The goal is to see whether unemployment status is related to education level.
The CPS data and documentation can be accessed from the U.S. Census Bureau website: https://www.census.gov/ .
Data Analysis
This section shows how I cleaned the data and created the variables needed for the analysis. I also include some simple summaries and graphs to understand the distribution of education and unemployment in the sample.
Load and inspect the data
cps <- read_csv("pppub23.csv")
## Rows: 146133 Columns: 829
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): PERIDNUM
## dbl (828): PH_SEQ, P_SEQ, A_LINENO, PF_SEQ, PHF_SEQ, OED_TYP1, OED_TYP2, OED...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#names(cps)[1:400]
# this helped me extract the right column names saved on file for age, education level,sex ,income.
# Look at the structure of the key variables
str(cps[c("A_AGE", "A_HGA", "PEMLR")])
## tibble [146,133 × 3] (S3: tbl_df/tbl/data.frame)
## $ A_AGE: num [1:146133] 66 68 52 51 78 65 68 74 74 76 ...
## $ A_HGA: num [1:146133] 39 39 39 40 39 40 41 34 33 44 ...
## $ PEMLR: num [1:146133] 5 5 6 1 5 5 5 5 5 5 ...
# Count missing values
colSums(is.na(cps[c("A_AGE", "A_HGA", "PEMLR")]))
## A_AGE A_HGA PEMLR
## 0 0 0
Filter adults 25+ and recode education and unemployment
# Keep only adults age 25 and older
cps_clean <- filter(cps, A_AGE >= 25)
# Recode education levels (A_HGA) using nested ifelse
# These cutoffs follow the CPS coding for major education groups
cps_clean$education_level <- ifelse(cps_clean$A_HGA <= 38, "High school or less",
ifelse(cps_clean$A_HGA == 39 | cps_clean$A_HGA == 40, "Some college/Associate",
ifelse(cps_clean$A_HGA == 41, "Bachelor's",
ifelse(cps_clean$A_HGA >= 42, "Graduate", NA))))
# Recode employment status (PEMLR)
cps_clean$unemployed <- ifelse(cps_clean$PEMLR == 2, 1,
ifelse(cps_clean$PEMLR == 1, 0, NA))
# Drop rows with missing values
cps_clean <- drop_na(cps_clean, education_level, unemployed)
# Summary of cleaned variables
summary(select(cps_clean, A_AGE, education_level, unemployed))
## A_AGE education_level unemployed
## Min. :25.00 Length:60724 Min. :0.00000
## 1st Qu.:35.00 Class :character 1st Qu.:0.00000
## Median :44.00 Mode :character Median :0.00000
## Mean :45.46 Mean :0.03641
## 3rd Qu.:55.00 3rd Qu.:0.00000
## Max. :85.00 Max. :1.00000
Exploratory summaries and graphs
# Frequency table of education level
table(cps_clean$education_level)
##
## Bachelor's Graduate High school or less
## 2865 30479 3926
## Some college/Associate
## 23454
# Frequency table of unemployment status
table(cps_clean$unemployed)
##
## 0 1
## 58513 2211
# Bar chart of number of respondents by education level
ggplot(cps_clean, aes(x = education_level)) +
geom_bar(fill = "steelblue") +
labs(title = "Number of Respondents by Education Level",
x = "Education Level",
y = "Count") +
theme_minimal()
# Simple unemployment rate by education level
unemp_rate <- cps_clean %>%
group_by(education_level) %>%
summarise(unemployment_rate = mean(unemployed))
unemp_rate
## # A tibble: 4 × 2
## education_level unemployment_rate
## <chr> <dbl>
## 1 Bachelor's 0.0335
## 2 Graduate 0.0387
## 3 High school or less 0.0346
## 4 Some college/Associate 0.0341
These summaries and plots suggest that the sample covers all four education categories and that the unemployment rate may be lower among people with higher education, but we need a formal statistical test to check whether the relationship is statistically significant.
Statistical Analysis
Both variables in this study are categorical:
education_level – four categories (“High school or less”, “Some college/Associate”, “Bachelor’s”, “Graduate”)
unemployed – 0 = employed, 1 = unemployed
Because we are comparing the relationship between two categorical variables, the appropriate test is a Chi-Square Test of Independence.
Hypotheses
Let the two variables be:
Education level
Unemployment status
(H₀): There is no association between education level and unemployment status among U.S. adults aged 25 and older. In other words, unemployment status is independent of education level.
(Hₐ): There is an association between education level and unemployment status. The distribution of employed and unemployed people is not the same across all education groups.
We will test these hypotheses at the α = 0.05 significance level.
Contingency table and Chi-Square test
# Contingency table: rows = education level, columns = employment status
edu_unemp_table <- table(cps_clean$education_level, cps_clean$unemployed)
edu_unemp_table
##
## 0 1
## Bachelor's 2769 96
## Graduate 29300 1179
## High school or less 3790 136
## Some college/Associate 22654 800
# Chi-square test of independence
chi_result <- chisq.test(edu_unemp_table)
chi_result
##
## Pearson's Chi-squared test
##
## data: edu_unemp_table
## X-squared = 9.0622, df = 3, p-value = 0.02848
The output of chisq.test() includes the Chi-Square test statistic, the degrees of freedom, and the p-value.
If the p-value < 0.05, we reject the null hypothesis and conclude that there is evidence of an association between education level and unemployment status.
If the p-value ≥ 0.05, we fail to reject the null hypothesis and conclude that the data do not provide strong evidence of an association.
Using the Chi-Square test, we obtained a p-value of 0.02848. Because this value is less than α = 0.05, we reject the null hypothesis and conclude that education level and unemployment status are significantly associated.
Conclusion and Future Directions
The goal of this project was to examine whether education level is associated with unemployment status among U.S. adults aged 25 and older. After cleaning the CPS dataset and organizing education into four categories, I constructed a two-way contingency table and used a Chi-Square Test of Independence to statistically evaluate the relationship.
The results of the Chi-Square test show a Chi-Square statistic of 9.0622 with 3 degrees of freedom, and a p-value of 0.02848. Because this p-value is less than the significance level α = 0.05, we reject the null hypothesis. This means the data provide statistically significant evidence of an association between education level and unemployment status. In practical terms, unemployment is not evenly distributed across education groups—some education levels experience higher unemployment rates than others.
These findings support the idea that educational attainment can influence labor market outcomes. Adults with higher levels of education may have access to more stable jobs or more competitive skills, while those with lower educational attainment may face greater challenges in securing employment.
Future Directions
Several extensions could deepen this analysis. Future work could:
Use a logistic regression model to measure how education predicts unemployment while controlling for other factors such as age, gender, race, and region.
Examine unemployment duration rather than just employment status, to understand long-term impacts. References
U.S. Census Bureau. (2023). Current Population Survey (CPS): Education and Employment Data [Data set]. Retrieved from https://www.census.gov