Project2

Introduction

Research Question: Does a higher level of education reduce the likelihood of being unemployed among U.S. adults aged 25 and older?

This project explores the relationship between education and unemployment using data from the U.S. Census Bureau’s Current Population Survey (CPS). The dataset contains thousands of observations on adults, including information such as age, gender, education, employment status, income, and other demographic factors.

For this study, I focus on adults aged 25 and older, as this group typically has completed formal education and is actively participating in the labor market.

The key variables used in the analysis are:

A_AGE: Age of the respondent

A_HGA: Highest grade or degree completed

PEMLR: Employment status (coded to indicate unemployed or employed)

unemployed: A derived binary variable coded 1 if unemployed and 0 if employed

The data are sourced from the CPS Public Use Microdata Sample, available through the U.S. Census Bureau (https://www.census.gov).

Data Analysis

This section includes data cleaning, variable transformation, and exploratory data analysis (EDA) to understand how unemployment varies by education level.

# Load dataset

cps <- read_csv("pppub23.csv")

## Rows: 146133 Columns: 829
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): PERIDNUM
## dbl (828): PH_SEQ, P_SEQ, A_LINENO, PF_SEQ, PHF_SEQ, OED_TYP1, OED_TYP2, OED...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Preview structure

head(cps)

## # A tibble: 6 × 829
##   PERIDNUM PH_SEQ P_SEQ A_LINENO PF_SEQ PHF_SEQ OED_TYP1 OED_TYP2 OED_TYP3 PERRP
##   <chr>     <dbl> <dbl>    <dbl>  <dbl>   <dbl>    <dbl>    <dbl>    <dbl> <dbl>
## 1 1577120…      3     1        1      1       1        0        0        0    40
## 2 1577120…      3     2        2      1       1        0        0        0    51
## 3 4933808…      4     1        1      1       1        0        0        0    40
## 4 4933808…      4     2        2      1       1        0        0        0    42
## 5 9100075…     15     1        1      1       1        0        0        0    41
## 6 6100006…     16     1        1      1       1        0        0        0    40
## # ℹ 819 more variables: PXRRP <dbl>, PXMARITL <dbl>, PXRACE1 <dbl>,
## #   PEHSPNON <dbl>, PXHSPNON <dbl>, PEAFEVER <dbl>, PXAFEVER <dbl>,
## #   PEAFWHN1 <dbl>, PXAFWHN1 <dbl>, PEAFWHN2 <dbl>, PEAFWHN3 <dbl>,
## #   PEAFWHN4 <dbl>, PXSPOUSE <dbl>, PENATVTY <dbl>, PXNATVTY <dbl>,
## #   PEMNTVTY <dbl>, PXMNTVTY <dbl>, PEFNTVTY <dbl>, PXFNTVTY <dbl>,
## #   PEINUSYR <dbl>, PXINUSYR <dbl>, PEPAR1 <dbl>, PXPAR1 <dbl>, PEPAR2 <dbl>,
## #   PXPAR2 <dbl>, PEPAR1TYP <dbl>, PXPAR1TYP <dbl>, PEPAR2TYP <dbl>, …

#names(cps)[1:400]

# this helped me extract the right column names saved on file for age, education level,sex ,income.

Cleaning and Recoding

# Keep only adults age 25 and older
cps_clean <- filter(cps, A_AGE >= 25)

# Recode education levels (A_HGA)

cps_clean$education_level <- ifelse(cps_clean$A_HGA <= 38, "High school or less",
                             ifelse(cps_clean$A_HGA == 39 | cps_clean$A_HGA == 40, "Some college/Associate",
                             ifelse(cps_clean$A_HGA == 41, "Bachelor's",
                             ifelse(cps_clean$A_HGA >= 42, "Graduate", NA))))

# Recode employment status (PEMLR)
# 2 = Unemployed, 1 = Employed
cps_clean$unemployed <- ifelse(cps_clean$PEMLR == 2, 1, 0)

# Remove rows with missing values
cps_clean <- drop_na(cps_clean, education_level, unemployed)

# Display summary of key variables
summary(select(cps_clean, A_AGE, education_level, unemployed))

##      A_AGE      education_level      unemployed     
##  Min.   :25.0   Length:98522       Min.   :0.00000  
##  1st Qu.:38.0   Class :character   1st Qu.:0.00000  
##  Median :51.0   Mode  :character   Median :0.00000  
##  Mean   :51.7                      Mean   :0.02244  
##  3rd Qu.:65.0                      3rd Qu.:0.00000  
##  Max.   :85.0                      Max.   :1.00000

Exploratory Data Analysis (EDA) Proportion of Unemployed by Education Level

# Find average unemployment rate for each education level

unemp_by_edu <- summarise(
  group_by(cps_clean, education_level),
  mean_unemployed = mean(unemployed)
)


unemp_by_edu

## # A tibble: 4 × 2
##   education_level        mean_unemployed
##   <chr>                            <dbl>
## 1 Bachelor's                      0.0217
## 2 Graduate                        0.0276
## 3 High school or less             0.0149
## 4 Some college/Associate          0.0189

Bar Chart: Sample Size by Education Level

ggplot(cps_clean, aes(x = education_level)) +
geom_bar(fill = "#1f77b4") +
labs(
title = "Sample Size by Education Level",
x = "Education Level",
y = "Count"
) +
theme_minimal()

Bar Chart: Unemployment Proportion by Education Level

ggplot(unemp_by_edu, aes(x = education_level, y = mean_unemployed)) +
geom_col(fill = "#ff7f0e") +
labs(
title = "Proportion Unemployed by Education Level",
x = "Education Level",
y = "Unemployment Rate (Proportion)"
) +
theme_minimal()

The exploratory data suggests that unemployment rates tend to decrease as education level increases, indicating a potential relationship between education and job security.

Statistical Analysis

To formally test whether mean unemployment differs by education level, I perform a one-way ANOVA.

Hypotheses

Let μ₁, μ₂, μ₃, and μ₄ represent the mean unemployment proportions for each education level:

μ₁ = High school or less

μ₂ = Some college/Associate

μ₃ = Bachelor’s

μ₄ = Graduate

Null Hypothesis (H₀): μ₁ = μ₂ = μ₃ = μ₄ Alternative Hypothesis (Hₐ): At least one μ differs

The test is conducted at a significance level of α = 0.05.

anova_fit <- aov(unemployed ~ education_level, data = cps_clean)
summary(anova_fit)

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## education_level     3    2.2  0.7199   32.84 <2e-16 ***
## Residuals       98518 2159.2  0.0219                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Group means for interpretation

cps_clean |>
  group_by(education_level) |>
  summarise(mean_unemployed = mean(unemployed))

## # A tibble: 4 × 2
##   education_level        mean_unemployed
##   <chr>                            <dbl>
## 1 Bachelor's                      0.0217
## 2 Graduate                        0.0276
## 3 High school or less             0.0149
## 4 Some college/Associate          0.0189

Interpreting Results

If p < 0.05, reject H₀: There is significant evidence that unemployment differs across education groups.

If p ≥ 0.05, fail to reject H₀: The data do not show strong evidence of differences.

Conclusion and Future Directions

This analysis examined whether education level affects unemployment among adults aged 25 and older using CPS data. The exploratory analysis revealed that individuals with higher education tend to have lower unemployment rates. The ANOVA test evaluated whether these differences are statistically significant.

If the ANOVA p-value is below 0.05, it suggests that at least one education group’s unemployment rate differs from others — supporting the idea that education plays a key role in reducing unemployment risk.

Future research could extend this analysis using logistic regression, adding predictors such as region, gender, race, and industry to model unemployment probabilities more precisely. It would also be valuable to track these patterns over time to observe whether the education–employment relationship remains stable or changes with economic cycles.

References

U.S. Census Bureau. (2023). Current Population Survey (CPS): Education and Employment Data [Data set]. Retrieved from https://www.census.gov

Project2

Zebidian

2025-11-08