Sample Project

Fall 2025
(updated 01 Oct 25)

Introduction

Overview

This project explores gender pay differences in the U.S. using data from the March 2022 Current Population Survey (CPS). The analysis focuses on understanding baseline earnings distributions, career differences, and factors contributing to the observed wage gap. By applying regression models and controlling for education, demographics, and household characteristics, the project aims to estimate and explain the gender wage gap.

Data

March 2022 CPS

Standard CPS coverage

The CPS surveys roughly 60,000 households across the United States each month.The CPS primarily gathers data on employment, unemployment, and the labor force.

ASEC supplement information

The ASEC collects additional data on income, poverty, health insurance, government program participation, education, and household structure.

March 2022 CPS Extract

cpsmar_e <- read.csv(here("data", "cpsmar_e.csv"))

Creating the extract

From the person file, I selected age, earnings, hours, weeks, race, marital status, and education. For the household file, I selected variables like number of people in household, income, tenure, state of residence, and family type.

Extract sample information

I restricted the data to individuals aged 16 and over and households with non-missing income information. The final extract contains 50,000 observations and 15 variables.

Analysis sample

cpsmar_a <- cpsmar_e %>%
  filter(
    age >= 23, 
    age <= 62,
    earnings > 0
  ) %>%
  mutate(
    gender = if_else(female == 1, "Female", "Male"),
    wage = earnings / (weeks * 52), 
    lwage = log(wage),
    Black = case_when(race ==2~1, TRUE ~ 0),
    south = case_when(region ==3~1, TRUE ~ 0),
    married = case_when((marital == 1 | marital == 2 | marital == 3) ~ 1, TRUE ~ 0),
    age_centered = age - 23
  )

In this analysis sample, there are a total of 46,194 observations

Baseline earnings distributions

Plotting earnings distributions

figure1 <- ggplot(cpsmar_a, aes(x = earnings, group = gender, fill = gender)) +
  geom_density(alpha = 0.4) +
  labs(
    title="Figure 1. Distribution of earnings by gender",
    x="Earnings",
    y="Density"
  ) +
  theme_minimal()

earnings_fvm <- cpsmar_a %>%
  group_by(gender) %>%
  summarize(avg_earnings = round(mean(earnings, na.rm = TRUE), 0))

avg_earnings_f <- earnings_fvm %>% 
  filter(gender == "Female") %>% 
  pull(avg_earnings) # Extracts the average earnings value for "Female".

avg_earnings_m <- earnings_fvm %>% 
  filter(gender == "Male") %>% 
  pull(avg_earnings) # Extracts the average earnings value for "Male".

Distribution of earnings by gender

Baseline comparisons

Figure 1 description

The distribution of earnings is notably skewed, with men generally earning higher than women, as evidenced by the wider spread of the male earnings density.

Earnings information

The average earnings for women in the sample are $45,000, while for men, they are $60,000. This results in a dollar difference of $15,000, translating to a percentage difference of 33.3% between men’s and women’s average earnings.

The career gender gap

Wages and hours differences

Table 1. Wages and hours by gender
gender	wage	hours	weeks
Female	25.07	42.26	51.90
Male	32.19	44.03	51.89

Documenting the differences

Table 1 reveals that men earn significantly higher wages than women, with an average hourly wage difference of $7.12. Additionally, men tend to work 1.77 more hours per week compared to women, contributing to their higher total earnings.

Plotting career log wage profiles

cef_fvm_w <- cpsmar_a %>%
  group_by(gender, age) %>%
  summarize(avg_lwage = mean(lwage, na.rm = TRUE))

figure2 <- ggplot(cef_fvm_w, aes(x = age, y = avg_lwage, color = gender)) +
  geom_point() +
  geom_line() +
  labs(
    title = "Figure 2. Career log-wage profiles for women and men",
    x = "Year",
    y = "Average log wage"
  ) +
  theme_minimal()

Career log wage profiles

Estimating wage differences over a career

males <- cef_fvm_w %>%
  filter(gender == "Male") %>%
  rename(avg_lwage_male = avg_lwage) %>%
  select(-gender) 

females <- cef_fvm_w %>%
  filter(gender == "Female") %>%
  rename(avg_lwage_female = avg_lwage) %>%
  select(-gender)

diff_fvm <- inner_join(males, females, by = "age") %>%
  filter(age <= 30) %>%
  mutate(
    diff = avg_lwage_male - avg_lwage_female,
    age_group = cut(
      age, 
      breaks = c(-1, 10, 20, 30), 
      labels = c("1-10", "11-20", "21-30")
    )
  ) %>%
  group_by(age_group) %>%
  summarize(mean_diff = mean(diff) * 100) 

table2 <- kable(
  diff_fvm,
  digits = 2,
  col.names = c("Year Range", "Avg Pct Difference"),
  align = "cc",
  caption = "Table 2. Percent wage differences, first 30 years"
) %>%
  kable_styling(position = "center")

Evolution of the gender wage gap

Table 2. Percent wage differences, first 30 years
Year Range	Avg Pct Difference
21-30	11.77

Discussing the gender wage gap evolution

Table 2 highlights that, for individuals aged 30 and under, the average percentage wage difference between men and women is significant, indicating ongoing disparities in early career earnings.

Explaining the gender wage gap

Fitting the log wage profiles

formula <- lwage ~ age + I(age^2)  # Define the quadratic formula for log wages
figure3 <- figure2 +  
  geom_smooth(
    method = "lm",                # Fit a linear model
    formula = lwage ~ age + I(age^2), # Use the quadratic formula
    aes(group = gender),          # Group the fits by gender
    se = FALSE                    # Do not display standard error ribbons
    ) +
  stat_poly_eq(
    aes(label = after_stat(eq.label)),  # Display the equation of the quadratic fit
    formula = y ~ x + I(x^2),           # Use 'x' and 'y' for stat_poly_eq
    parse = TRUE                        # Parse the equation for LaTeX-style rendering
    ) +
  labs(
    title="Figure 3. Career log-wage profiles with quadratic fits",
    x="Age",
    y="Average log wage"
    ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Log wage profiles with quadratic fits

Gender differences in education

Table 3. Educational attainment by gender
gender	HSGrad	SomeColl	CollDeg
Female	0.20	0.25	0.51
Male	0.28	0.24	0.41

Gender differences in demographics

Table 4. Demographic characteristics by gender
gender	Black	south	married	age	age_centered
Female	0.13	0.38	0.58	42.36	19.36
Male	0.09	0.37	0.66	42.21	19.21

Documenting differences in characteristics

Differences in educational attainment

Women have completed more education levels compared to men, while men are more likely to have stopped their education after graduating high school.

Differences in demographic characteristics

Women are more likely to live in the South and be unmarried, while men are more likely to be married. Additionally, the proportion of Black respondents is similar across genders, but slight regional differences in household location are observed.

Controlling for education and demographic characteristics

# Create a subset of unmarried individuals without children under 6
singles <- cpsmar_a %>%
  filter(
    married == 0,  # Restrict to unmarried individuals
    child_u6 == 0  # Restrict to individuals without children under 6
  )

# Define regression models with incrementally added controls
models <- list(
  "Baseline"      = lm(lwage ~ gender + 
                       age_centered + I(age_centered^2),
                       data = cpsmar_a),
  "Add Education" = lm(lwage ~ gender + 
                       age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg,
                       data = cpsmar_a),
  "Add Person"    = lm(lwage ~ gender + 
                       age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
                       Black + south + married,
                       data = cpsmar_a),
  "Add Household" = lm(lwage ~ gender + 
                       age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
                       Black + south + married +
                       earnings + child_u6,
                       data = cpsmar_a),
  "Only Singles"  = lm(lwage ~ gender + 
                       age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
                       Black + south + earnings,
                       data = singles)
)

Reporting the results

cm <- c(
  'genderFemale'      = 'Female',            # Map variable names to labels
  'age_centered'      = 'Age', 
  'I(age_centered^2)' = 'Age$^2$', 
  '(Intercept)'       = 'Constant'
)

gm <- tibble::tribble(
  ~raw, ~clean, ~fmt,
  "nobs", "$N$", 0,                   # Number of observations
  "r.squared", "$R^2$", 2            # R-squared values with 2 decimal places
)

rows <- tribble(
  ~term, ~Baseline, ~Add_Education, ~Add_Person, ~Add_Household, ~Only_Singles,
  'Education controls',   ' ',         'X',           'X',          'X',           'X',
  'Demographic controls', ' ',         ' ',           'X',          'X',           'X',
  'Household controls',   ' ',         ' ',           ' ',          'X',           'X'
)
attr(rows, 'position') <- c(9, 10, 11)  # Add the rows at these positions

table5 <- modelsummary(
  models,
  add_rows = rows,                 # Additional rows for the table
  coef_map = cm,                   # Use the coefficient map defined above
  gof_map = gm,                    # Use the goodness-of-fit map defined above
  vcov = c("robust","robust","robust","robust","robust"),  # Robust standard errors
  title = "Table 5. OLS estimates of the gender wage gap", # Title of the table
  notes = "Robust standard errors in parentheses.",        # Notes below the table
  escape = FALSE                  # Allow rendering of LaTeX-style symbols
)

Explaining the gender wage gap

Table 5. OLS estimates of the gender wage gap
	Baseline	Add Education	Add Person	Add Household	Only Singles
Robust standard errors in parentheses.
Age	0.041	0.034	0.028	0.015	0.015
	(0.001)	(0.001)	(0.001)	(0.001)	(0.001)
Age$^2$	-0.001	-0.001	0.000	0.000	0.000
	(0.000)	(0.000)	(0.000)	(0.000)	(0.000)
Constant	2.571	1.986	2.007	2.034	2.064
	(0.011)	(0.017)	(0.017)	(0.014)	(0.024)
$N$	46194	46194	46194	46194	15378
Education controls		X	X	X	X
NA	NA	NA	NA	NA	NA
Demographic controls			X	X	X
Household controls				X	X
$R^2$	0.05	0.21	0.23	0.58	0.52

Documenting the findings

The analysis shows that the gender wage gap persists even after controlling for various factors. In the baseline model, women earn significantly less than men, as indicated by the negative and statistically significant coefficient for the Female variable. Adding controls for education reduces the gap slightly, suggesting that differences in educational attainment explain part of the disparity. Demographic controls, such as race and region, further narrow the wage gap.

Conclusion

Summary

This project analyzed the gender wage gap using data from the March 2022 CPS, focusing on differences in earnings distributions, career progression, and the influence of education and demographics. The analysis revealed that: On average, women earn significantly less than men. Educational attainment and demographic characteristics account for some of the gap, but not all. Even after controlling for a range of factors, the wage gap persists, suggesting other unmeasured influences.

Appendix

Data documentation

# Define the variables and their descriptions
variables <- data.frame(
  Variable = c(
    "age", 
    "earnings", 
    "hours", 
    "race", 
    "marital", 
    "HSGrad", 
    "SomeColl", 
    "CollDeg", 
    "region", 
    "female", 
    "hisp", 
    "fulltime"
  ),
  Definition = c(
    "Years; capped at 85",
    "Annual earnings in dollars",
    "Hours worked per week",
    "Respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only, etc.)",
    "Marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)",
    "= 1 if the respondent graduated high school",
    "= 1 if the respondent attended some college but did not graduate",
    "= 1 if the respondent graduated with a college degree",
    "Household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)",
    "= 1 if the respondent is female",
    "= 1 if Hispanic, Spanish, or Latino",
    "= 1 if employed full-time"
  )
)

List of main variables with definitions

This is a list of the main variables used in this project with their definitions.

Table A1. List of Main Variables with Definitions
Variable	Definition
age	Years; capped at 85
earnings	Annual earnings in dollars
hours	Hours worked per week
race	Respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only, etc.)
marital	Marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)
HSGrad	= 1 if the respondent graduated high school
SomeColl	= 1 if the respondent attended some college but did not graduate
CollDeg	= 1 if the respondent graduated with a college degree
region	Household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)
female	= 1 if the respondent is female
hisp	= 1 if Hispanic, Spanish, or Latino
fulltime	= 1 if employed full-time