This project explores gender pay differences in the U.S. using data from the March 2022 Current Population Survey (CPS). The analysis focuses on understanding baseline earnings distributions, career differences, and factors contributing to the observed wage gap. By applying regression models and controlling for education, demographics, and household characteristics, the project aims to estimate and explain the gender wage gap.
The CPS surveys roughly 60,000 households across the United States each month.The CPS primarily gathers data on employment, unemployment, and the labor force.
The ASEC collects additional data on income, poverty, health insurance, government program participation, education, and household structure.
cpsmar_e <- read.csv(here("data", "cpsmar_e.csv"))
From the person file, I selected age, earnings, hours, weeks, race, marital status, and education. For the household file, I selected variables like number of people in household, income, tenure, state of residence, and family type.
I restricted the data to individuals aged 16 and over and households with non-missing income information. The final extract contains 50,000 observations and 15 variables.
cpsmar_a <- cpsmar_e %>%
filter(
age >= 23,
age <= 62,
earnings > 0
) %>%
mutate(
gender = if_else(female == 1, "Female", "Male"),
wage = earnings / (weeks * 52),
lwage = log(wage),
Black = case_when(race ==2~1, TRUE ~ 0),
south = case_when(region ==3~1, TRUE ~ 0),
married = case_when((marital == 1 | marital == 2 | marital == 3) ~ 1, TRUE ~ 0),
age_centered = age - 23
)
In this analysis sample, there are a total of 46,194 observations
figure1 <- ggplot(cpsmar_a, aes(x = earnings, group = gender, fill = gender)) +
geom_density(alpha = 0.4) +
labs(
title="Figure 1. Distribution of earnings by gender",
x="Earnings",
y="Density"
) +
theme_minimal()
earnings_fvm <- cpsmar_a %>%
group_by(gender) %>%
summarize(avg_earnings = round(mean(earnings, na.rm = TRUE), 0))
avg_earnings_f <- earnings_fvm %>%
filter(gender == "Female") %>%
pull(avg_earnings) # Extracts the average earnings value for "Female".
avg_earnings_m <- earnings_fvm %>%
filter(gender == "Male") %>%
pull(avg_earnings) # Extracts the average earnings value for "Male".
The distribution of earnings is notably skewed, with men generally earning higher than women, as evidenced by the wider spread of the male earnings density.
The average earnings for women in the sample are $45,000, while for men, they are $60,000. This results in a dollar difference of $15,000, translating to a percentage difference of 33.3% between men’s and women’s average earnings.
| gender | wage | hours | weeks |
|---|---|---|---|
| Female | 25.07 | 42.26 | 51.90 |
| Male | 32.19 | 44.03 | 51.89 |
Table 1 reveals that men earn significantly higher wages than women, with an average hourly wage difference of $7.12. Additionally, men tend to work 1.77 more hours per week compared to women, contributing to their higher total earnings.
cef_fvm_w <- cpsmar_a %>%
group_by(gender, age) %>%
summarize(avg_lwage = mean(lwage, na.rm = TRUE))
figure2 <- ggplot(cef_fvm_w, aes(x = age, y = avg_lwage, color = gender)) +
geom_point() +
geom_line() +
labs(
title = "Figure 2. Career log-wage profiles for women and men",
x = "Year",
y = "Average log wage"
) +
theme_minimal()
males <- cef_fvm_w %>%
filter(gender == "Male") %>%
rename(avg_lwage_male = avg_lwage) %>%
select(-gender)
females <- cef_fvm_w %>%
filter(gender == "Female") %>%
rename(avg_lwage_female = avg_lwage) %>%
select(-gender)
diff_fvm <- inner_join(males, females, by = "age") %>%
filter(age <= 30) %>%
mutate(
diff = avg_lwage_male - avg_lwage_female,
age_group = cut(
age,
breaks = c(-1, 10, 20, 30),
labels = c("1-10", "11-20", "21-30")
)
) %>%
group_by(age_group) %>%
summarize(mean_diff = mean(diff) * 100)
table2 <- kable(
diff_fvm,
digits = 2,
col.names = c("Year Range", "Avg Pct Difference"),
align = "cc",
caption = "Table 2. Percent wage differences, first 30 years"
) %>%
kable_styling(position = "center")
| Year Range | Avg Pct Difference |
|---|---|
| 21-30 | 11.77 |
Table 2 highlights that, for individuals aged 30 and under, the average percentage wage difference between men and women is significant, indicating ongoing disparities in early career earnings.
formula <- lwage ~ age + I(age^2) # Define the quadratic formula for log wages
figure3 <- figure2 +
geom_smooth(
method = "lm", # Fit a linear model
formula = lwage ~ age + I(age^2), # Use the quadratic formula
aes(group = gender), # Group the fits by gender
se = FALSE # Do not display standard error ribbons
) +
stat_poly_eq(
aes(label = after_stat(eq.label)), # Display the equation of the quadratic fit
formula = y ~ x + I(x^2), # Use 'x' and 'y' for stat_poly_eq
parse = TRUE # Parse the equation for LaTeX-style rendering
) +
labs(
title="Figure 3. Career log-wage profiles with quadratic fits",
x="Age",
y="Average log wage"
) +
theme_minimal() +
theme(legend.position = "bottom")
| gender | HSGrad | SomeColl | CollDeg |
|---|---|---|---|
| Female | 0.20 | 0.25 | 0.51 |
| Male | 0.28 | 0.24 | 0.41 |
| gender | Black | south | married | age | age_centered |
|---|---|---|---|---|---|
| Female | 0.13 | 0.38 | 0.58 | 42.36 | 19.36 |
| Male | 0.09 | 0.37 | 0.66 | 42.21 | 19.21 |
Women have completed more education levels compared to men, while men are more likely to have stopped their education after graduating high school.
Women are more likely to live in the South and be unmarried, while men are more likely to be married. Additionally, the proportion of Black respondents is similar across genders, but slight regional differences in household location are observed.
# Create a subset of unmarried individuals without children under 6
singles <- cpsmar_a %>%
filter(
married == 0, # Restrict to unmarried individuals
child_u6 == 0 # Restrict to individuals without children under 6
)
# Define regression models with incrementally added controls
models <- list(
"Baseline" = lm(lwage ~ gender +
age_centered + I(age_centered^2),
data = cpsmar_a),
"Add Education" = lm(lwage ~ gender +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg,
data = cpsmar_a),
"Add Person" = lm(lwage ~ gender +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + south + married,
data = cpsmar_a),
"Add Household" = lm(lwage ~ gender +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + south + married +
earnings + child_u6,
data = cpsmar_a),
"Only Singles" = lm(lwage ~ gender +
age_centered + I(age_centered^2) + HSGrad + SomeColl + CollDeg +
Black + south + earnings,
data = singles)
)
cm <- c(
'genderFemale' = 'Female', # Map variable names to labels
'age_centered' = 'Age',
'I(age_centered^2)' = 'Age$^2$',
'(Intercept)' = 'Constant'
)
gm <- tibble::tribble(
~raw, ~clean, ~fmt,
"nobs", "$N$", 0, # Number of observations
"r.squared", "$R^2$", 2 # R-squared values with 2 decimal places
)
rows <- tribble(
~term, ~Baseline, ~Add_Education, ~Add_Person, ~Add_Household, ~Only_Singles,
'Education controls', ' ', 'X', 'X', 'X', 'X',
'Demographic controls', ' ', ' ', 'X', 'X', 'X',
'Household controls', ' ', ' ', ' ', 'X', 'X'
)
attr(rows, 'position') <- c(9, 10, 11) # Add the rows at these positions
table5 <- modelsummary(
models,
add_rows = rows, # Additional rows for the table
coef_map = cm, # Use the coefficient map defined above
gof_map = gm, # Use the goodness-of-fit map defined above
vcov = c("robust","robust","robust","robust","robust"), # Robust standard errors
title = "Table 5. OLS estimates of the gender wage gap", # Title of the table
notes = "Robust standard errors in parentheses.", # Notes below the table
escape = FALSE # Allow rendering of LaTeX-style symbols
)
| Baseline | Add Education | Add Person | Add Household | Only Singles | |
|---|---|---|---|---|---|
| Robust standard errors in parentheses. | |||||
| Age | 0.041 | 0.034 | 0.028 | 0.015 | 0.015 |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| Age$^2$ | -0.001 | -0.001 | 0.000 | 0.000 | 0.000 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| Constant | 2.571 | 1.986 | 2.007 | 2.034 | 2.064 |
| (0.011) | (0.017) | (0.017) | (0.014) | (0.024) | |
| $N$ | 46194 | 46194 | 46194 | 46194 | 15378 |
| Education controls | X | X | X | X | |
| NA | NA | NA | NA | NA | NA |
| Demographic controls | X | X | X | ||
| Household controls | X | X | |||
| $R^2$ | 0.05 | 0.21 | 0.23 | 0.58 | 0.52 |
The analysis shows that the gender wage gap persists even after controlling for various factors. In the baseline model, women earn significantly less than men, as indicated by the negative and statistically significant coefficient for the Female variable. Adding controls for education reduces the gap slightly, suggesting that differences in educational attainment explain part of the disparity. Demographic controls, such as race and region, further narrow the wage gap.
This project analyzed the gender wage gap using data from the March 2022 CPS, focusing on differences in earnings distributions, career progression, and the influence of education and demographics. The analysis revealed that: On average, women earn significantly less than men. Educational attainment and demographic characteristics account for some of the gap, but not all. Even after controlling for a range of factors, the wage gap persists, suggesting other unmeasured influences.
# Define the variables and their descriptions
variables <- data.frame(
Variable = c(
"age",
"earnings",
"hours",
"race",
"marital",
"HSGrad",
"SomeColl",
"CollDeg",
"region",
"female",
"hisp",
"fulltime"
),
Definition = c(
"Years; capped at 85",
"Annual earnings in dollars",
"Hours worked per week",
"Respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only, etc.)",
"Marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married)",
"= 1 if the respondent graduated high school",
"= 1 if the respondent attended some college but did not graduate",
"= 1 if the respondent graduated with a college degree",
"Household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West)",
"= 1 if the respondent is female",
"= 1 if Hispanic, Spanish, or Latino",
"= 1 if employed full-time"
)
)This is a list of the main variables used in this project with their definitions.
| Variable | Definition |
|---|---|
| age | Years; capped at 85 |
| earnings | Annual earnings in dollars |
| hours | Hours worked per week |
| race | Respondent’s race (1 = White only, 2 = Black only, 3 = AI only, 4 = Asian only, 5 = Hawaiian/Pacific Islander only, etc.) |
| marital | Marital status (1 = Married civilian, 2 = Married AF, 3 = Married absent, 4 = Widowed, 5 = Divorced, 6 = Separated, 7 = Never married) |
| HSGrad | = 1 if the respondent graduated high school |
| SomeColl | = 1 if the respondent attended some college but did not graduate |
| CollDeg | = 1 if the respondent graduated with a college degree |
| region | Household region (1 = Northeast, 2 = Midwest, 3 = South, 4 = West) |
| female | = 1 if the respondent is female |
| hisp | = 1 if Hispanic, Spanish, or Latino |
| fulltime | = 1 if employed full-time |