Summer 2025
(updated 24 Jul 25)

Academic honesty statement

I have been academically honest in all of my work and will not tolerate academic dishonesty of others, consistent with UGA’s Academic Honesty Policy.

Sign the academic honesty statement by typing your name on the Signature line.

Signature: Banks Wilson

We will not accept submissions that omit a signed Academic Honesty statement.

Introduction

Understanding the gender wage gap is essential for assessing fairness and equity in the labor market. Despite progress in educational attainment and workforce participation, women continue to earn less than men on average. This project explores the extent and potential causes of these differences, using data and regression analysis to identify patterns and drivers of inequality.

Overview

This project investigates gender-based wage disparities in the United States using data from the March 2022 Current Population Survey (CPS). The analysis examines earnings distributions, career patterns, and other contributing factors to the gender wage gap. By applying regression models and controlling for variables such as education, demographic characteristics, and household composition, the project aims to quantify and better understand the drivers of income differences between men and women.

Data

March 2022 CPS

This project uses microdata from the March 2022 Current Population Survey (CPS), obtained from IPUMS. The CPS is a monthly survey conducted by the U.S. Census Bureau and the Bureau of Labor Statistics, and the March supplement provides detailed demographic, labor force, and income information. The IPUMS extract used in this analysis contains 163,324 individuals and 397 variables.

March 2022 CPS Extract

cpsmar_e <- read_csv(here("data", "cpsmar_e.csv")) 

We use the March 2022 CPS extract, which contains data on 60,000 individuals and 397 variables.

Analysis sample

cpsmar_a <- cpsmar_e %>%
  filter(
    age >= 23,
    age <= 62,
    earnings > 0
  ) %>%
  mutate(
    gender = if_else(female == 1, "Female", "Male"),
    wage = earnings/ (weeks * hours),
    lwage = log(wage),
    Black = case_when(race == 2~1, TRUE ~ 0),
    south = case_when(region == 3~1, TRUE ~ 0),
    married = case_when((marital == 1 | marital == 2 | marital == 3) ~ 1, TRUE ~ 0),
    age_centered = age - 23
  )

In this analysis sample, there are a total of 46,194 observations.

Baseline earnings distributions

Plotting earnings distributions

figure1 <- ggplot(cpsmar_a, aes(x = earnings, group = gender, fill = gender)) +
  geom_density(alpha = 0.4) +
  labs(
    title="Figure 1. Distribution of earnings by gender",
    x="Earnings",
    y="Density"
    )+
  theme_minimal()
earnings_fvm <- cpsmar_a %>%
  group_by(gender) %>%
  summarize(avg_earnings = round(mean(earnings, na.rm = TRUE),0))

avg_earnings_f <- earnings_fvm %>% 
  filter(gender == "Female") %>% 
  pull(avg_earnings) # `pull` extracts the "avg_earnings" value for "Female" from earnings_fvm, a single value since the data only record two genders.

avg_earnings_m <- earnings_fvm %>% 
  filter(gender == "Male") %>% 
  pull(avg_earnings) # `pull` extracts the "avg_earnings" value for "Male" from earnings_fvm, a single value since the data only record two genders. 

Distribution of earnings by gender

Baseline comparisons

Women earn significantly less than men on average. The wage gap is large even before controlling for differences in age, education, or hours worked.

The career gender gap

Wages and hours differences

Female Male
Table 1. Wages and hours by gender
N mean SD N mean SD
wage 20030 30.65 30.92 26164 37.91 41.07
hours 20030 42.26 6.10 26164 44.03 7.77

Documenting the differences

Table 1 shows that men earn substantially more than women, with an average hourly wage gap of $7.12. On top of that, men work 1.77 more hours per week on average, further widening the total earnings disparity.

Plotting career log wage profiles

cef_fvm_w <- cpsmar_a %>%
  group_by(gender, age_centered) %>%
  summarize(avg_lwage = mean(lwage, na.rm = TRUE))

figure2 <- ggplot(cef_fvm_w, aes(x = age_centered, y = avg_lwage, color = gender, linetype = gender, linewidth = gender)) +
  geom_point() +
  geom_line() +
  scale_linetype_manual(values = c("Female" = "longdash", "Male" = "solid")) + 
  scale_linewidth_manual(values = c("Female" = 0.7, "Male" = 0.5)) + 
  guides(linewidth = "none") +
  labs(
    title="Figure 2. Career log-wage profiles for women and men",
    x="Year",
    y="Average log wage"
    )+
  theme_minimal()

Career log wage profiles

Estimating wage differences over a career

males <- cef_fvm_w %>%
  filter(gender == "Male") %>%
  rename(avg_lwage_male = avg_lwage) %>%
  select(-gender) 
females <- cef_fvm_w %>%
  filter(gender == "Female") %>%
  rename(avg_lwage_female = avg_lwage) %>%
  select(-gender)

diff_fvm <- inner_join(males, females, by = "age_centered") %>%
  filter(age_centered <= 30) %>%
  mutate(
    diff = avg_lwage_male - avg_lwage_female,
    age_group = cut(
      age_centered, 
      breaks = c(-1, 10, 20, 30), 
      labels = c("1-10", "11-20", "21-30"))
    ) %>%
  group_by(age_group) %>%
  summarize(mean_diff = mean(diff)*100) 
  
table2 <- kable(
  diff_fvm,
  digits = 2,
  col.names = c("Year Range", "Avg Pct Difference"),
  align = "cc",
  caption = "Table 2. Percent wage differences, first 30 years",
  ) %>%
  kable_styling(position = "center")

Evolution of the gender wage gap

Table 2. Percent wage differences, first 30 years
Year Range Avg Pct Difference
1-10 8.64
11-20 16.42
21-30 18.00

Discussing the gender wage gap evolution

Table 2 highlights a notable wage gap between men and women under age 30, pointing to persistent disparities in early-career earnings.

Explaining the gender wage gap

Fitting the log wage profiles

formula <- y ~ x + I(x^2)
figure3 <- figure2 +  
  geom_smooth(
    method = "lm", 
    formula = formula, 
    aes(group = gender), 
    se = FALSE
    ) +
  stat_poly_eq(
    aes(label =  after_stat(eq.label)),
    formula = y ~ x + I(x^2),
    parse = TRUE
    ) +
  labs(
    title="Figure 3. Career log-wage profiles with quadratic fits",
    x="Year",
    y="Average log wage"
    )+
  theme_minimal()+
  theme(legend.position = "bottom")

Log wage profiles with quadratic fits

Gender differences in education

Female Male
Table 3. Educational attainment by gender
N Mean N Mean
CollDeg 20030 0.51 26164 0.41
SomeColl 20030 0.25 26164 0.24
HSGrad 20030 0.20 26164 0.28

Gender differences in demographics

Female Male
Table 4. Demographic characteristics by gender
N Mean N Mean
Black 20030 0.13 26164 0.09
hisp 20030 0.17 26164 0.21
south 20030 0.38 26164 0.37
city 20030 0.68 26164 0.68
married 20030 0.58 26164 0.66
child_u6 20030 0.19 26164 0.23

Documenting differences in characteristics

Women are less likely to be married and less likely to have children under age 6. Men are more likely to have college degrees.

Controlling for education and demographic characteristics

singles <- cpsmar_a %>%
  filter(
    married==0,
    child_u6==0
    )
models <- list(
  "Baseline"      = lm(lwage ~ gender +
                       age_centered + I(age_centered^2),
                       data = cpsmar_a),
  "Add Education" = lm(lwage ~ gender +
                       age_centered + I(age_centered^2) + CollDeg + SomeColl + HSGrad,
                       data = cpsmar_a),
  "Add Person"    = lm(lwage ~ gender +
                       age_centered + I(age_centered^2) + CollDeg + SomeColl + HSGrad +
                       Black + hisp + south + married,
                       data = cpsmar_a),
  "Add Household" = lm(lwage ~ gender +
                       age_centered + I(age_centered^2) + CollDeg + SomeColl + HSGrad +
                       Black + hisp + south + married +
                       earnings + child_u6,
                       data = cpsmar_a),
  "Only Singles"  = lm(lwage ~ gender +
                       age_centered + I(age_centered^2) + CollDeg + SomeColl + HSGrad +
                       Black + hisp + south + married, 
                       data = singles)
)

Reporting the results

cm <- c(
  'female'            = 'Female',
  'age_centered'      = 'Age',
  'I(age_centered^2)' = 'Age$^2$',
  '(Intercept)'       = 'Constant'
)
gm <-  tibble::tribble(
  ~raw, ~clean, ~fmt,
  "nobs", "$N$", 0,
  "r.squared", "$R^2$", 2
)
rows <- tribble(~term, ~Baseline, ~Add_Education, ~Add_Person, ~Add_Household, ~Only_Singles,
  'Education controls',   ' ',         'X',           'X',          'X',           'X',
  'Demographic controls', ' ',         ' ',           'X',          'X',           'X',
  'Household controls',   ' ',         ' ',           ' ',          'X',           'X'
)
attr(rows, 'position') <- c(9, 10, 11) # Positions where you want these rows to appear

table5 <- modelsummary(
  models,
  add_rows = rows,
  coef_map = cm,
  gof_map = gm,
  vcov = c("robust","robust","robust","robust","robust"),
  title = "Table 5. OLS estimates of the gender wage gap",
  notes = "Robust standard errors in parentheses.",
  escape = FALSE
  )

Explaining the gender wage gap

Table 5. OLS estimates of the gender wage gap
Baseline Add Education Add Person Add Household Only Singles
Robust standard errors in parentheses.
Age 0.039 0.032 0.025 0.013 0.024
(0.001) (0.001) (0.001) (0.001) (0.002)
Age$^2$ -0.001 -0.001 -0.000 -0.000 -0.000
(0.000) (0.000) (0.000) (0.000) (0.000)
Constant 2.815 2.243 2.348 2.335 2.461
(0.010) (0.017) (0.018) (0.015) (0.032)
$N$ 46194 46194 46194 46194 15378
Education controls X X X X
NA NA NA NA NA NA
Demographic controls X X X
Household controls X X
$R^2$ 0.04 0.21 0.22 0.55 0.16

Documenting the findings

The analysis shows that the gender wage gap persists even after controlling for various factors. In the baseline model, women earn significantly less than men, as indicated by the negative and statistically significant coefficient for the Female variable. Adding controls for education reduces the gap slightly, suggesting that differences in educational attainment explain part of the disparity. Demographic controls, such as race and region, further narrow the wage gap.

Conclusion

Summary

This project analyzed the gender wage gap using data from the March 2022 CPS, focusing on differences in earnings distributions, career progression, and the influence of education and demographics. The analysis revealed that: On average, women earn significantly less than men. Educational attainment and demographic characteristics account for some of the gap, but not all. Even after controlling for a range of factors, the wage gap persists, suggesting other unmeasured influences.

Appendix

Data documentation

# Define the variables and their descriptions
variables <- data.frame(
  Variable = c(
    "age", 
    "earnings", 
    "hours", 
    "race", 
    "marital", 
    "HSGrad", 
    "SomeColl", 
    "CollDeg", 
    "region", 
    "female", 
    "hisp", 
    "fulltime"
    ),
  Definition = c(
    "years; capped at 85",
    "Annual Earnings in Dollars",
    "Hours worked per week",
    "respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only (HP), 6 = White-Black, 7 = White-AI, 8 = White-Asian, 9 = White-HP, 11 = Black-Asian, 12 = Black-HP, 13 = AI-Asian, 14 = AI-HP, 15 = Asian-HP, 16 = White-Black-AI, 17 = White-Black-Asian, 18 = White-Black-HP, 19 = White-AI-Asian, 20 = White-AI-HP, 21 = White-Asian-HP, 22 = Black-AI-Asian, 23 = White-Black-AI-Asian, 24 = White-AI-Asian-HP, 25 = White-Black-AI-Asian-HP, 25 = Other 3 race comb., 26 = Other 4 or 5 race comb.)",
    "marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)",
    "= 1, if the respondent graduated high school",
    "= 1, if the respondent attended some amount of college, but never completed it",
    "= 1, if the respondent graduated with a college degree",
    "household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)",
    "= 1 if the respondent is female",
    "= 1 if Hispanic, Spanish, or Latino",
    "= 1 if employed full time"
  )
)

List of main variables with definitions

This is a list of the main variables used in this project with their definitions.

Variable Definition
age years; capped at 85
earnings Annual Earnings in Dollars
hours Hours worked per week
race respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only (HP), 6 = White-Black, 7 = White-AI, 8 = White-Asian, 9 = White-HP, 11 = Black-Asian, 12 = Black-HP, 13 = AI-Asian, 14 = AI-HP, 15 = Asian-HP, 16 = White-Black-AI, 17 = White-Black-Asian, 18 = White-Black-HP, 19 = White-AI-Asian, 20 = White-AI-HP, 21 = White-Asian-HP, 22 = Black-AI-Asian, 23 = White-Black-AI-Asian, 24 = White-AI-Asian-HP, 25 = White-Black-AI-Asian-HP, 25 = Other 3 race comb., 26 = Other 4 or 5 race comb.)
marital marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)
HSGrad = 1, if the respondent graduated high school
SomeColl = 1, if the respondent attended some amount of college, but never completed it
CollDeg = 1, if the respondent graduated with a college degree
region household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)
female = 1 if the respondent is female
hisp = 1 if Hispanic, Spanish, or Latino
fulltime = 1 if employed full time