These three papers researching the gender wage gap in Mexico provide some important context to our own research. The first article compared the gender wage gap based on education in the years 1988 and 1998. The second article observes trends in the gender wage gap between 2000 and 2010, and the impact of occupational discrimination on the wage gap. The last article observes the gender wage gap between 2005 and 2020, and the relationship between the wage gap and income inequality.
Literature Review: Wage Inequality and the Gender Wage Gap in Mexico
Gonzalez explores the gender wage gap in Mexico between 1988 and 1998 and how other social or demographic skills affect the difference. She uses Mexico as an example of a developed economy to compare the wage trends to that of similar countries. Specifically, Gonzalez uses education as an unmeasured skill, and how different levels of education affect the wages of men and women during this time period. The results show that having a college degree increased pay over the decade at a higher rate for women than men, but women still earned less than men overall. She wanted to add to existing scholarship about the wage gap trends in developed economies, specifically how wage inequality due to unmeasured skills like education, affects the gender pay gap. She uses Mexico as a case study to compare the gender wage gap to different countries explored in other research based on certain skill levels. This would test the theory that a higher return to skills in one country leads to a larger gender wage differential than a country with a lower return rate. She tested her hypothesis using both a difference of means and a difference-in-difference test when analyzing the data. Gonzalez used data from multiple cities around the country collected in the Urban Employment Survey. Education is a categorical variable, describing the level of education (none, primary, secondary, high school, and college). Wage data was measured hourly, calculated by using monthly earnings based on hours worked per week, using workers between 16 and 65 years old who worked at least 30 hours prior to the survey. The results show that, predictably, education increased pay while educated men still earned more than educated women on average. In 1988 for men, education increased the pay rate by three pesos on average (when compared to men with only primary education), which increased to 3.36 pesos in 1998. For women, the difference in 1988 was 2 pesos and 3.25 by 1998. Based on median wages, the benefits of education were larger for men than women. In the economic context of increased income inequality and decreasing gender pay gap, Gonzalez found this decrease was impacted by the increased presence of women in the workforce, especially after the economic crisis in 1996. She notes that there is a trade-off between wage inequality and the gender pay gap, and recommends policies addressing inequality over the pay gap.
Literature Review: The gender wage gap and occupational segregation in the Mexican labour market
This article by Orraca, Cabrera, and Iriarte aims to study the gender wage gap and the influence of occupational segregation on the pay gap in the years 2000 and 2010. The authors used regression analysis using log(wage) as the dependent variable, and gender, occupation, and occupation differentials. Ultimately, they found job discrimination based on gender was not a significant contributor to the wage gap. This paper aims to expand on previous scholarship by distinguishing between gender-based discrimination in wages and occupation. The authors believe the distinction between the two is important when analyzing the effects of gender in the workplace. Previous models did not account for this distinction. Additionally, they present issues with previous research, specifically the “index number method” of calculating the gender wage gap because it doesn’t differentiate between wage and occupation discrimination. In addition, that method assumes occupation is exogenous. The authors used regression analysis and data collected from the Housing and Population census in both years to measure the effects of gender on hourly wages. They also controlled for age, schooling, how many hours worked (per week), marital status, job setting (urban/rural), part-time versus full-time, labor force participation, and unemployment rate.The specific model used was multinomial logit (MNL) to address the probability of someone working in a specific occupation, with that occupation choice being endogenous. Then, the authors explored the pay difference between men and women across the different control variables (setting, industry) using the within-between estimates in a Random Effects model. While the evidence did not support a large effect from occupational segregation in the wage gap, partial gender-based segregation in the workforce still has some effect on the pay gap within occupations seemingly based on productivity (or related factors). In both 2000 and 2010, men were paid, on average, more than women, with the gender difference increasing from 0.021 log points to 0.026, respectively. This difference is in spite of women on average having more education than men. Additionally, men are more likely to participate in part-time and/or less formal jobs, and are more likely to be married. One possible explanation for the marital status variable is that married women are less likely to hold a salaried job then a married man. All other variables indicate a more complicated story than solely occupational segregation. The authors recommend policies should be aimed at removing discriminatory barriers, like the wage gap, and encouraging diversity in higher levels of occupations (senior/management positions), instead of emphasizing equal gender representation in different industries.
Literature Review: Gender Wage Discrimination Across the Income Distribution in Mexico
The study by Mendoza González and Miguel Ángel (2023) provides a comprehensive analysis of gender wage discrimination across the income distribution in Mexico from 2005 to 2020. Using quantile regression and a non-parametric decomposition methodology developed by Chernozhukov, Fernández-Val, and Melly (2013), the authors separate the gender wage gap into its constituent composition (endowment) and structure (returns or discrimination) effects, a non-linear relationship in effect to the median. Their work contributes significantly to the literature by identifying how gender-based wage disparities evolve across the wage distribution rather than focusing solely on average differences. The decomposition distinguishes between differences in observable characteristics, using education (categorical based on degrees), experience and seniority (calculated with potential years worked based on age and education), firm size, firm location (urban, or not), head of household, children under six and unexplained differences that are attributed to discrimination. The authors argue that gender wage discrimination arises when men and women with similar endowments of human capital (e.g., education, seniority) receive different returns in the labor market, indicating that wage differentials are not due solely to productivity differences. Their use of microdata from the ENOE allows for a rich set of individual and workplace controls, including urban vs. rural location, firm size, and family composition. The study employs distributional tests such as the Kolmogorov-Smirnov and Cramér-von Mises tests to confirm that the parametric models do not fully capture the wage distribution, justifying the use of non-parametric tools. The robustness of the quantile decomposition approach allows the authors to reject the null hypothesis of no wage gap across the distribution. This approach confirms that gender wage gaps are not uniform, and efforts to address them must be targeted to specific segments of the labor market. Key findings demonstrate that the structure effect—interpreted as discrimination—is the primary driver of the wage gap, accounting for up to 45% of the total difference in hourly wages by 2020, with an upward trend since 2005. Overall, when compared to the composition effect, structural effect increases in the relative importance in explaining total gender hourly wage differences, from 38.6% in 2005 to 45.2% in 2020. Notably, this structure effect is more pronounced at lower income deciles, aligning with the sticky floor hypothesis, which suggests that women face greater disadvantage at the bottom of the wage distribution. According to the results of the quantile effects in gender hourly-wage differentials, male workers secure hourly wages up to 12% higher than women at low segments of the income distribution. While some women at higher income levels earn more per hour than men, the discrimination component diminishes, indicating that wage gaps narrow or reverse in favor of women at the top end of the income spectrum. The analysis also highlights that differences in working hours partially explain the overall wage gap, with men working more hours across nearly all income deciles. However, the returns to observable characteristics, such as education and seniority, also differ significantly. Women’s returns to education are higher than men’s at the lower end of the distribution but decline at higher income levels, while men benefit more from seniority where on average, 22% of men have worked for their current employer longer, especially in higher income brackets. This suggests that while human capital composition matters, they do not fully account for gender disparities, reinforcing the role of discrimination in wage setting. Spatial distribution then accounts for these results, the majority of all workers are located in cities, both men and women at higher-income levels work mainly in urban areas, whereas women are more likely to work in non-urban areas than men at low-income levels. Additionally, gender wage discrimination remains evident in small, medium, and large firms, and is exacerbated at middle and high-income levels, signaling structural issues in firm-level wage setting practices. However, age and parents with children under the age of six were not found to be a consistent source of discrimination. Contrary to common narratives, female-headed households and mothers did not systematically earn less when other factors were controlled for. Structural differences in the labor market, rather than human capital disparities, are the main source of the wage gap. Their findings call for public policies aimed at reforming labor institutions, firm practices, and cultural norms, particularly to support women of lower socioeconomic status to address the gender wage inequality in Latin America and beyond.
We aim to contribute to this research in a few areas. First, our dataset was collected in 2020, therefore expanding research closer to present-day information regarding the gender wage gap in Mexico. As the first study dismissed the impact of occupational discrimination as only a minor contribution to the wage gap, we explore other variables to see the proportion affected by additional factors not studied for in that article. The second article shares a significant amount of variables with our research, meaning we can by providing more recent data from 2020, but using the same variables to see the current (2020) status of wages in relation to gender and education in particular. Observing the trends at different points in time across several decades can advance preexisting literature. Additionally, this information can be useful when analyzing policies created within this timeline and can inform future policy proposals. Lastly, our current research adds to González and Ángel’s recent literature, following trends in Mexico 2020 and the persistence of the gender wage gap. Although our research does not include differentials in income distribution, our research follows previous multivariable regression methodology for earned income. Our study also adds additional variables to household status like marital status while also excluding limitations in estimates of hourly wage.
Bibliography González, Mendoza and Ángel, Miguel. “Gender wage discrimination by distribution of income in Mexico, 2005-2020.” Latin American Economic vol.29, 1 (5). Centro de Investigación y Docencia Económica (CIDE), Ciudad de México (2020): 1-20. doi:10.47872/laer-2020-29-5. Meza González, Liliana. 2001. Wage inequality and the gender wage gap in Mexico . http://hdl.handle.net/11651/4169 Orraca, Pedro, Francisco-Javier Cabrera, and Gustavo Iriarte. 2016. “The Gender Wage Gap and Occupational Segregation in the Mexican Labour Market.” EconoQuantum 13 (1): 51–72. https://www.scielo.org.mx/scielo.php?pid=S1870-66222016000100051&script=sci_arttext&tlng=en.
Loading the data Data source: IPUMS International, Mexico 2020
#Load the dataset
ddi <- read_ipums_ddi("/Users/opandey/Desktop/Econometrics/Paper/Gender/ipumsi_00005.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS International is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
Cleaning the Data The initial stages of data preparation revealed an unintended issue caused by the inclusion of the variable CHBORN (children ever born). This variable is collected only for women in the dataset, meaning that most male respondents have missing values for it. When these missing values were filtered out to clean the data, nearly all men were excluded from the sample. As a result, the dataset became overwhelmingly female, which led to a lack of variation in the gender variable (SEX). In statistical terms, this created a situation of perfect multicollinearity—where a variable does not vary within the sample—causing the gender variable to be automatically dropped from the regression model. Since gender is one of the key independent variables in this analysis, preserving variation across both men and women was critical. To resolve this, CHBORN was removed entirely from the analysis. This decision ensured that both male and female respondents remained in the sample and that gender could be properly included as a predictor in the regression model.
In addition to handling the gender imbalance, the data cleaning process also addressed issues related to extreme and invalid income values. The income variable (INCEARN) included zero-income observations as well as a small number of extremely high values, both of which could distort regression estimates. Observations with clearly invalid or top-coded values were excluded (e.g., INCEARN == 99999999), and a cap was imposed to exclude outliers above 10 million pesos, following guidance from the IPUMS coding index. To allow for meaningful interpretation and variance stabilization in the regression, a new variable log_INCEARN was created by applying the natural logarithm to income. However, since the log of zero or negative values is undefined, the transformation was applied only to observations where income was strictly positive. This allowed the original dataset to retain zero-income respondents for descriptive analysis, while the log-transformed variable was used only for regression modeling and diagnostic testing.
Log Transformation of Income The decision to transform income using the natural logarithm was driven by both statistical theory and observed characteristics of the raw data. In its original form, the income variable exhibited significant right-skewness: the majority of individuals reported relatively low earnings, while a small number of respondents reported extremely high incomes. This kind of skewed distribution often leads to heteroskedasticity in regression models, meaning that the variance of the error terms is not constant across all levels of the independent variables. Such a violation undermines one of the core assumptions of ordinary least squares (OLS) regression and can result in biased standard errors, leading to incorrect inferences.
To address this, the variable log_INCEARN was constructed as the natural logarithm of income, applied only to observations with income greater than zero. The log transformation has several advantages. It compresses the scale of high-income values, reduces the influence of extreme outliers, and often results in a more symmetric distribution that better satisfies the assumptions of linear regression. By doing so, the transformation not only improves the interpretability of the income variable—since regression coefficients can be interpreted as approximate percentage changes—but also enhances the stability and reliability of the regression estimates. This transformation was a crucial step in preparing the dataset for diagnostic testing of the model, particularly for detecting and addressing heteroskedasticity.
#Cleaning and preparing the dataset
data_clean <- data %>%
filter(
!MARST %in% c(0, 9), # Remove 'NIU (not in universe)' and 'Unknown/missing' marital status
!YRSCHOOL %in% c(90, 91, 92, 93, 94, 95, 96, 98, 99),
# Remove Not specified, some primary, technical after primary, secondary, tertiary, adult literacy, special education, unknown/missing, NIU 'Not reported' years of schooling respectively
INCEARN != 99999999, # Remove 'NIU or Missing' income
INCEARN < 1e7 # Cap extreme outliers at 10 million pesos # Remove zero income
) %>%
mutate(
female = ifelse(SEX == 2, 1, 0), # Create binary gender variable
log_INCEARN = ifelse(INCEARN > 0, log(INCEARN), NA), # Apply log only where income > 0
MARST_LABEL = factor(MARST,
levels = c(1, 2, 3, 4),
labels = c("Single", "Married/in union", "Separated/divorced/spouse absent", "Widowed")
),
SEX_LABEL = factor(SEX, levels = c(1, 2), labels = c("Male", "Female"))
)
# Subset only for modeling log income
data_model <- data_clean %>% drop_na(log_INCEARN)
Summary Statistics
# Create binary variables before generating summary stats
data_clean <- data_clean %>%
mutate(
female = ifelse(SEX == 2, 1, 0),
married = ifelse(MARST == 2, 1, 0) # Married/in union = 1
)
# Prepare variables for summary table
summary_vars <- data_clean %>%
select(INCEARN, log_INCEARN, YRSCHOOL, female, married) %>%
drop_na()
#Set table format for publication
options(modelsummary_format_numeric_latex = "plain")
# Generate the summary table
datasummary(
INCEARN+log_INCEARN + YRSCHOOL + female + married ~
Mean + Median + SD + Min + Max + N,
data = summary_vars,
title = "Summary Statistics for Regression Model Variables"
)
| Mean | Median | SD | Min | Max | N | |
|---|---|---|---|---|---|---|
| Earned income | 6565.00 | 5160.00 | 15606.46 | 1.00 | 1e+06 | 2397639 |
| log_INCEARN | 8.40 | 8.55 | 0.92 | 0.00 | 13.82 | 2397639 |
| Years of schooling | 9.27 | 9.00 | 4.45 | 0.00 | 18.00 | 2397639 |
| female | 0.34 | 0.00 | 0.47 | 0.00 | 1.00 | 2397639 |
| married | 0.64 | 1.00 | 0.48 | 0.00 | 1.00 | 2397639 |
The summary statistics table provides an initial snapshot of the key variables used in the regression model. Earned income (INCEARN), measured in Mexican pesos, exhibits substantial variability, with a mean of 6,565 pesos and a standard deviation over twice the mean (15,606 pesos). The maximum reported income is one million pesos, underscoring the presence of extreme outliers and a right-skewed distribution. To mitigate this skewness, a log transformation of income was applied. The transformed variable, log_INCEARN, displays a tighter distribution with a mean of 8.40 and a median of 8.55.
Educational attainment averages 9.27 years, with values ranging from zero to 18, indicating a mix of low- and high-education respondents. Roughly 34% of individuals in the sample are women, and 64% report being married or in a union.
To further explore the distributional characteristics of income and identify potential violations of model assumptions, a residual-versus-fitted plot was generated using the raw income variable. As shown in Figure 1, the residuals expand outward as fitted income increases, forming a distinct funnel shape—a classic indicator of heteroskedasticity. This pattern suggests that the variance of the error term grows with income, violating the assumption of homoskedasticity. Such violations can lead to incorrect standard errors and compromised statistical inference. To formally confirm the presence of heteroskedasticity, a diagnostic test is conducted next, followed by a corrected regression using log-transformed income.
# Raw income model
# Add binary variables to data_model
data_model <- data_model %>%
mutate(
female = ifelse(SEX == 2, 1, 0),
married = ifelse(MARST == 2, 1, 0)
)
# Raw income model
model_raw <- lm(INCEARN ~ female + married + YRSCHOOL, data = data_model)
# Add residuals and fitted values
data_model <- data_model %>%
mutate(
uhat_raw = resid(model_raw),
yhat_raw = fitted(model_raw)
)
# Plot residuals vs fitted values
ggplot(data_model, aes(x = yhat_raw, y = uhat_raw)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residuals vs Fitted Values (Raw Income)",
x = "Fitted Income",
y = "Residuals"
) +
theme_minimal()
Figure 1: Residuals vs Fitted Values for Raw Income Model The residual
plot for the raw income model reveals a clear fan-shaped pattern,
indicating that the variance of the residuals increases with higher
fitted income values. This suggests that the assumption of
homoskedasticity is violated, as the spread of residuals is not constant
across all levels of income. These patterns highlight considerable
variation in income and demographic characteristics, which raises
concerns about potential non-constant error variance across subgroups.
To ensure that regression estimates remain valid and inference reliable,
we next assess whether the model violates the classical assumption of
homoskedasticity by conducting a heteroskedasticity test.
Homoskedasticity test for INCEARN
# Model with raw income
model_income <- lm(INCEARN ~ SEX + MARST + YRSCHOOL, data = data_clean)
model_log_income <- lm(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_clean)
# Residual diagnostics: residuals and fitted values
data_model <- data_clean %>%
mutate(
uhat = resid(model_income),
yhat = fitted(model_income),
uhatsq = uhat^2,
yhatsq = yhat^2
)
# Breusch-Pagan test
bp_model <- lm(uhatsq ~ SEX + MARST + YRSCHOOL, data = data_model)
# BP test components
r2 <- summary(bp_model)$r.squared
n <- nobs(bp_model)
k1 <- bp_model$rank - 1
F_stat <- (r2 / k1) / ((1 - r2) / (n - k1 - 1))
F_pval <- pf(F_stat, k1, n - k1 - 1, lower.tail = FALSE)
LM_stat <- n * r2
LM_pval <- pchisq(LM_stat, df = k1, lower.tail = FALSE)
# Show test results in a table
knitr::kable(
data.frame(
Statistic = c(
"R-squared (auxiliary regression)",
"Number of observations (n)",
"Degrees of freedom (k)",
"F-statistic",
"F p-value",
"LM-statistic",
"LM p-value"
),
Value = round(c(r2, n, k1, F_stat, F_pval, LM_stat, LM_pval), 4)
),
caption = "Table X: Breusch-Pagan Test for Heteroskedasticity (Raw Income Model)",
digits = 4
)
| Statistic | Value |
|---|---|
| R-squared (auxiliary regression) | 0.0002 |
| Number of observations (n) | 2755176.0000 |
| Degrees of freedom (k) | 3.0000 |
| F-statistic | 147.0919 |
| F p-value | 0.0000 |
| LM-statistic | 441.2055 |
| LM p-value | 0.0000 |
The Breusch-Pagan test results confirm the presence of heteroskedasticity in the raw income model. The auxiliary regression yields an R-squared of 0.0002, and with over 2.75 million observations, the test has substantial power to detect even minor deviations from homoskedasticity. The F-statistic of 147.09 (p < 0.001) and the LM-statistic of 441.21 (also highly significant) provide strong statistical evidence that the variance of residuals is not constant across levels of the predictors.
These results align with the residual plot presented earlier, where residuals visibly fan out as fitted income increases—a classic indication of heteroskedasticity. Because heteroskedasticity can bias standard error estimates and weaken the reliability of inference in OLS models, we proceed by applying a log transformation to the income variable. The next section will evaluate whether this transformation improves model fit and reduces heteroskedasticity, using both visual and statistical diagnostics.
Test for Homoskedacity on log(INCEARN)
data_model <- data_clean %>%
filter(!is.na(log_INCEARN)) # filter to valid rows only
# Model with log-transformed income
model_log_income <- lm(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_model)
# Residuals and fitted values
data_model <- data_model %>%
mutate(
uhat_log = resid(model_log_income),
yhat_log = fitted(model_log_income),
uhatsq_log = uhat_log^2
)
# Breusch-Pagan regression for log income
bp_log <- lm(uhatsq_log ~ SEX + MARST + YRSCHOOL, data = data_model)
# Components
r2_log <- summary(bp_log)$r.squared
n_log <- nobs(bp_log)
k_log <- bp_log$rank - 1
F_stat_log <- (r2_log / k_log) / ((1 - r2_log) / (n_log - k_log - 1))
F_pval_log <- pf(F_stat_log, k_log, n_log - k_log - 1, lower.tail = FALSE)
LM_stat_log <- n_log * r2_log
LM_pval_log <- pchisq(LM_stat_log, df = k_log, lower.tail = FALSE)
# Display results as a table, auxiliary regression is the secondary regression
knitr::kable(
data.frame(
Statistic = c(
"R-squared (auxiliary regression)",
"Number of observations (n)",
"Degrees of freedom (k)",
"F-statistic",
"F p-value",
"LM-statistic",
"LM p-value"
),
Value = round(c(r2_log, n_log, k_log, F_stat_log, F_pval_log, LM_stat_log, LM_pval_log), 4)
),
caption = "Table X: Breusch-Pagan Test for Heteroskedasticity (Log Income Model)",
digits = 4)
| Statistic | Value |
|---|---|
| R-squared (auxiliary regression) | 0.0010 |
| Number of observations (n) | 2397639.0000 |
| Degrees of freedom (k) | 3.0000 |
| F-statistic | 797.7791 |
| F p-value | 0.0000 |
| LM-statistic | 2390.9548 |
| LM p-value | 0.0000 |
The Breusch-Pagan test results for the log-transformed income model (Table X) indicate that heteroskedasticity remains present, although its magnitude is notably reduced compared to the raw income model. The F-statistic of 797.78 and a highly significant p-value (p < 0.001) provide strong evidence that the variance of the residuals is still not constant across observations, even after applying a logarithmic transformation. The auxiliary regression’s low R-squared value (0.001) suggests that the explanatory power of the heteroskedasticity remains minimal, but still statistically significant due to the large sample size.
Given this result, the classical OLS assumptions are violated, and standard errors derived under the assumption of homoskedasticity would be unreliable. To obtain valid inference and ensure robustness of the estimated coefficients, the next section re-estimates the model using heteroskedasticity-consistent (robust) standard errors.
Side by side comparision of Breusch-Pagan test results for both models (raw income and log income):
bp_compare <- data.frame(
Test = c("F-statistic", "F p-value", "LM-statistic", "LM p-value"),
Raw_Income = round(c(F_stat, F_pval, LM_stat, LM_pval), 4),
Log_Income = round(c(F_stat_log, F_pval_log, LM_stat_log, LM_pval_log), 4)
)
knitr::kable(
bp_compare,
caption = "Comparison of Breusch-Pagan Test Results: Raw vs Log Income Model"
)
| Test | Raw_Income | Log_Income |
|---|---|---|
| F-statistic | 147.0919 | 797.7791 |
| F p-value | 0.0000 | 0.0000 |
| LM-statistic | 441.2055 | 2390.9548 |
| LM p-value | 0.0000 | 0.0000 |
The comparison of Breusch-Pagan test results between the raw and log-transformed income models reinforces the presence of heteroskedasticity in both specifications. While the F-statistic and LM-statistic for the raw income model (147.09 and 441.21, respectively) are already large and significant (p < 0.001), the log-transformed model yields even higher test values (F = 797.78, LM = 2390.95). This indicates that the log transformation, although helpful in compressing extreme income values, does not eliminate heteroskedasticity entirely.
This finding justifies the need to proceed with robust standard errors in the final regression analysis. By applying heteroskedasticity-consistent variance estimators (e.g., White’s robust SEs), we can ensure valid inference even in the presence of non-constant residual variance. The next section presents regression results using both raw and log-transformed income models with robust standard errors to evaluate consistency across specifications.
Side-by-side regression table to show how coefficients, SEs, and interpretations shift
# Model with log-transformed income
model_log_income <- lm(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_model)
# Compare regression results
model_list <- list(
"Raw Income" = model_income,
"Log Income" = model_log_income
)
modelsummary(model_list, vcov = "robust", gof_omit = ".*") #
| Raw Income | Log Income | |
|---|---|---|
| (Intercept) | 636.394 | 7.950 |
| (34.996) | (0.003) | |
| SEX | -1613.634 | -0.376 |
| (19.463) | (0.001) | |
| MARST | 994.100 | 0.106 |
| (13.555) | (0.001) | |
| YRSCHOOL | 596.942 | 0.082 |
| (2.394) | (0.000) |
Table 1: Side-by-side regression results for raw and log-transformed income models
Table 1 displays regression results for both the raw and log-transformed income models using heteroskedasticity-consistent (HC0) standard errors. Across specifications, the direction and statistical significance of the coefficients remain consistent, but their interpretation changes due to the transformation of the dependent variable. In the raw income model, being female is associated with a reduction of approximately 1,613 pesos in earned income. However, in the log-transformed model, being female corresponds to a 37.6% lower income, all else equal. This percentage interpretation arises from the log transformation and more accurately reflects proportional differences. The next figure visualizes these effects to facilitate comparison.
Side-by-side regression table to show how coefficients, SEs, and interpretations shift in a graphResiduals vs Fitted Values: Raw Income vs Log Income
Figure 2: Residuals vs Fitted Values for Raw Income and Log Income Models
Figure 2 visualizes residuals from the raw and log-transformed income regressions, plotted against fitted income values. The left panel (Raw Income Model) reveals a pronounced funnel-shaped pattern, where the spread of residuals increases with higher predicted income values. This pattern is indicative of heteroskedasticity—a violation of the OLS assumption that the variance of residuals remains constant across observations. It suggests that the model fits lower-income individuals better than higher-income ones.
In contrast, the right panel (Log Income Model) shows a more uniform spread of residuals, with no visible increase in variance across fitted values. This indicates that applying a log transformation to income substantially reduces heteroskedasticity. The transformation compresses the scale of higher incomes, stabilizes variance, and improves the model’s suitability for linear regression.
Together, these plots visually reinforce the results of the Breusch-Pagan tests: while both models exhibit some heteroskedasticity, the log-transformed model provides a better-behaved error structure, making it more appropriate for inference using OLS.
Given the improved but not perfect error structure, robust standard errors are now applied to the log-income model to correct for any residual heteroskedasticity and ensure consistent estimation of standard errors.
Heteroskedasticity-robust standard errors:
model_log_income <- lm(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_clean)
# Use White's heteroskedasticity-robust standard errors (HC0)
# We specify vcov = "HC0" because the default 'robust' in modelsummary() returns HC3
# HC0 is the original White's estimator and commonly used in large samples
modelsummary(model_log_income, vcov = "HC0") #
| (1) | |
|---|---|
| (Intercept) | 7.950 |
| (0.003) | |
| SEX | -0.376 |
| (0.001) | |
| MARST | 0.106 |
| (0.001) | |
| YRSCHOOL | 0.082 |
| (0.000) | |
| Num.Obs. | 2397639 |
| R2 | 0.167 |
| R2 Adj. | 0.167 |
| AIC | 5972954.5 |
| BIC | 5973017.9 |
| Log.Lik. | -2986472.228 |
| RMSE | 0.84 |
| Std.Errors | HC0 |
Table 2: Regression Results with Heteroskedasticity-Robust Standard Errors
Final Regression Results with Robust Standard Errors
The regression model uses log-transformed income as the dependent variable and is estimated using White’s heteroskedasticity test—a form of the Breusch-Pagan test—to evaluate whether the variance of the error term is constant across observations. The results from the test strongly reject the null hypothesis of homoskedasticity, confirming the presence of heteroskedasticity in the model.
To correct for this issue and ensure valid inference, White’s heteroskedasticity-consistent (HC0) standard errors are applied in the estimation. These robust standard errors adjust the standard error estimates without altering the coefficient values, allowing for accurate hypothesis testing despite the violation of classical OLS assumptions.
The model indicates that females (SEX = 2) earn, on average, approximately 37.6% less than males, controlling for marital status and education. This result is statistically significant at the 1% level. Similarly, being married or in a union (MARST = 2) is associated with a 10.6% increase in logged income, and each additional year of schooling yields an estimated 8.2% increase in income.
The model explains about 16.7% of the variation in log income, as reflected in the R² and Adjusted R² values. Additional model fit statistics are provided:
AIC (Akaike Information Criterion): 5,972,954.5 — lower values indicate better model fit. BIC (Bayesian Information Criterion): 5,973,017.9 — penalizes model complexity more heavily than AIC. Log Likelihood: -2,986,472.2 — higher (less negative) suggests better model fit. RMSE (Root Mean Squared Error): 0.84 — reflects the typical prediction error in the log-income scale. Std. Errors: HC0 — robust to heteroskedasticity.
Together, these diagnostics reinforce the reliability of the final regression estimates and confirm that the use of robust methods was appropriate.
ggplot(data_clean %>% filter(!is.na(log_INCEARN)), aes(x = female, y = log_INCEARN)) +
geom_point(alpha = 0.2, color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", aes(col = "Fitted Line")) +
labs(
title = "Scatter Plot of Female Indicator vs Logged Income",
x = "Female (0 = Male, 1 = Female)",
y = "Log of Earned Income (Pesos)",
color = ""
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Figure: Relationship between Gender (Female) and Logged Income
Figure 3: Scatter Plot of Female Indicator vs Logged Income Figure 3 displays a scatter plot of log-transformed income against the female indicator variable, where males are coded as 0 and females as 1. The fitted regression line slopes downward, indicating a negative association between being female and logged income. This visual pattern corroborates the regression results, where the coefficient on the female indicator was significantly negative.
Since the income variable is log-transformed, the slope of the fitted line can be interpreted as an approximate percentage decrease in income associated with being female, holding marital status and years of schooling constant. The clustering of points and consistent downward trend reinforce the existence of a systematic gender gap in income, providing intuitive support for the statistical findings presented earlier.
income_summary <- data_clean %>%
filter(!is.na(INCEARN), INCEARN < 1e6) %>% # filter large or invalid values
group_by(female) %>% # assuming 0 = male, 1 = female
summarise(
mean_income = mean(INCEARN),
se = sd(INCEARN) / sqrt(n()),
lower = mean_income - 1.96 * se,
upper = mean_income + 1.96 * se
) %>%
mutate(Gender = ifelse(female == 1, "Female", "Male"))
ggplot(income_summary, aes(x = Gender, y = mean_income)) +
geom_point(size = 3, color = "steelblue") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1, color = "red") +
labs(
title = "Mean Earned Income by Gender with 95% Confidence Interval",
x = "Gender", y = "Mean Income (Pesos)"
) +
theme_minimal()
Figure 4: 95% Confidence Interval of Mean Earned Income by Gender with
95% Confidence Interval Note: Error bars represent 95% confidence
intervals around the mean. Non-overlapping intervals indicate
statistical significance.
Figure 4 presents the mean earned income by gender, accompanied by 95% confidence intervals. The vertical red bars represent the confidence intervals around the mean estimates, while the blue points indicate the average income for each gender. The mean income for males is approximately 6,000 pesos, whereas for females it is closer to 5,200 pesos. Importantly, the confidence intervals do not overlap, indicating that the observed gender gap in income is statistically significant.
This graphical evidence aligns with the regression analysis and residual plots shown earlier. The consistent income disparity observed across both point estimates and confidence intervals supports the conclusion that gender plays a substantial role in shaping income outcomes in this sample. The visualization reinforces the need for modeling approaches that account for gender-based disparities and validates the earlier decision to apply robust standard errors for reliable inference.
income_log_summary <- data_clean %>%
filter(!is.na(log_INCEARN)) %>%
group_by(Gender = ifelse(female == 1, "Female", "Male")) %>%
summarise(
mean_log_income = mean(log_INCEARN),
se = sd(log_INCEARN) / sqrt(n()),
n = n(),
.groups = "drop"
) %>%
mutate(
lower = mean_log_income - 1.96 * se,
upper = mean_log_income + 1.96 * se
)
ggplot(income_log_summary, aes(x = Gender, y = mean_log_income)) +
geom_point(size = 3, color = "darkgreen") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1, color = "orange") +
labs(
title = "Mean Logged Income by Gender with 95% Confidence Interval",
x = "Gender", y = "Mean Log Income (Pesos)"
) +
theme_minimal()
Figure 5: 95% Confidence Interval of Mean Logged Income by Gender with
95% Confidence Interval Note: Error bars represent 95% confidence
intervals around the mean. Non-overlapping intervals indicate
statistical significance.
Figure 5 displays the mean logged income by gender, accompanied by 95% confidence intervals. The average log income for males is approximately 8.5, while for females it is around 8.2. The vertical error bars indicate tight confidence intervals due to the large sample size, and their lack of overlap suggests that the gender difference is statistically significant at the 5% level. Given the logarithmic transformation of income, the gap between male and female means translates to an approximate 30% lower income for females. This visualization reinforces the findings from the regression analysis, which also showed a negative and significant association between the female indicator and logged income.
Confidence Interval Bounds Figure 4: Mean Earned Income (Raw)
Male: Mean ≈ 6,000 pesos
95% CI: [5,950, 6,050]
Female: Mean ≈ 5,200 pesos
95% CI: [5,150, 5,250]
Figure 5: Mean Logged Income
Male: Mean ≈ 8.5
95% CI: [8.48, 8.52]
Female: Mean ≈ 8.2
95% CI: [8.18, 8.22]
The confidence intervals for both raw and logged income illustrate a clear and statistically significant income gap between males and females. In raw terms, males earn about 800 pesos more on average, with no overlap in the 95% confidence bounds. In logged income, this translates to a roughly 30% lower income for females, also confirmed by non-overlapping, narrow confidence intervals due to the large sample size. Together, these findings reinforce the regression results and confirm that the observed gender pay gap is both statistically and substantively significant.
Population Regression Model Specification We are interested in estimating the effect of gender on individual income, controlling for education and marital status. The income variable is log-transformed to interpret coefficients as approximate percentage changes.
Population Model
The population regression model we have in mind is specified as:
\[ \log(\text{INCOME}_i) = \beta_0 + \beta_1 \text{FEMALE}_i + \beta_2 \text{MARST}_i + \beta_3 \text{YRSCHOOL}_i + u_i \]
Where:
For our regression analysis, we use log(wage) as the dependent variable and education, gender, and marital status as independent variables. Since our study focuses on the wage gap, our variables of log(wage) and gender are inherently necessary. Income was adapted to log(income) to control for outliers in the data. The log transformation of income addresses skewness and allows for a more interpretable linear model, where coefficients represent elasticities or approximate percentage effects. Additionally, we had to use the robust method to control for heteroskedasticity with the income variable. Gender was coded into men (0) and women (1). We also decided to control for education, since education can greatly impact job opportunities as well as pay. Education is measured in years, with estimates of education level removed (ie “some primary/secondary/college education”) to ensure the variable is entirely ratio-level. There is some merit to coding education as an ordinal variable, similar to other research; however, we decided on the ratio-level variable. Additionally, including both ratio and ordinal measurements within the same variable would likely yield confusing data. Lastly, marital status is another factor we controlled for, since relationship status can affect both jobs and income. The data was recoded into 0 for not actively in a relationship and 1 for actively in a relationship, We are curious if that relation still exists, and to what effect. All unknown, unanswered, or similar responses were removed from all variables in the dataset to yield clearer and more accurate regression results and coefficients. We originally attempted to use the number of children as another independent variable, but ran into some significant issues. When attempting to control for the number of children, we realized all the male respondents and their data disappeared from the dataset. This is because the IPUMS data only included children born for mothers, not fathers. As a result, the gender variable was eliminated due to perfect collinearity. As such, we had to exclude the children born variable to properly run the regression on the gender wage gap. By including variables like education and marital status, we are aiming to meet the zero conditional mean assumption to support our causality argument, instead of a correlatory relationship. There are most likely other omitted variables included in the error terms as well. Time and data constraints resulted in the selection of a few variables likely to affect the gender pay gap, and those choices were supported by previous research cited in the literature review. However, excluding certain variables from our regression could lead to the artificial inflation or deflation of coefficients, based on the relevant omitted variables’ relation to included variables.
# Stepwise models
model_1 <- feols(log_INCEARN ~ SEX, data = data_clean)
## NOTE: 357,537 observations removed because of NA values (LHS: 357,537).
model_2 <- feols(log_INCEARN ~ SEX + MARST, data = data_clean)
## NOTE: 357,537 observations removed because of NA values (LHS: 357,537).
model_3 <- feols(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_clean)
## NOTE: 357,537 observations removed because of NA values (LHS: 357,537).
# Present results in publication-style table
modelsummary(
list(
"Model 1" = model_1,
"Model 2" = model_2,
"Model 3" = model_3
),
stars = TRUE,
statistic = "std.error",
gof_map = tibble::tribble(
~raw, ~clean, ~fmt,
"nobs", "Num. Obs.", 0,
"r.squared", "R²", 3,
"adj.r.squared", "Adj. R²", 3
),
title = "Regression Results: Stepwise Inclusion of Control Variables"
)
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| (Intercept) | 8.762*** | 8.750*** | 7.950*** |
| (0.002) | (0.002) | (0.002) | |
| SEX | -0.267*** | -0.268*** | -0.376*** |
| (0.001) | (0.001) | (0.001) | |
| MARST | 0.007*** | 0.106*** | |
| (0.001) | (0.001) | ||
| YRSCHOOL | 0.082*** | ||
| (0.000) | |||
| Num. Obs. | 2397639 | 2397639 | 2397639 |
| R² | 0.019 | 0.019 | 0.167 |
| Adj. R² | 0.019 | 0.019 | 0.167 |
Table 3: Regression Results: Stepwise Inclusion of Control Variables This table presents a sequence of linear regression models estimating the determinants of logged income (log_INCEARN), with control variables added incrementally across three specifications. All models are estimated using heteroskedasticity-robust (HC0) standard errors, as evidenced by prior diagnostic tests indicating violations of homoskedasticity.
Model 1 includes only the gender indicator (SEX) as a predictor. The coefficient on SEX is -0.267 (SE = 0.001), statistically significant at the 1% level. This suggests that, on average, females (coded as 1) earn approximately 26.7% less than males (coded as 0), holding no other variables constant. The R² value is 0.019, indicating that gender alone explains about 1.9% of the variation in log income.
Model 2 adds marital status (MARST) to the model. The coefficient on SEX remains negative and significant at -0.268, with virtually no change from Model 1, suggesting that marital status does not explain away the gender gap. The coefficient on MARST is 0.007 (SE = 0.001), significant at the 1% level, implying a modest positive association between being married and logged income. However, the R² remains unchanged at 0.019, indicating limited additional explanatory power.
Model 3 introduces YRSCHOOL (years of schooling) as a control variable. The coefficient on SEX becomes slightly more negative at -0.376 (SE = 0.001), suggesting that once educational attainment is accounted for, the gender income gap becomes even larger. This implies that part of the raw income gap was “masked” by differences in education, with women potentially having higher education but still earning less. The coefficient on MARST rises substantially to 0.106 (SE = 0.001), indicating a stronger positive effect of being married when education is controlled. YRSCHOOL enters with a coefficient of 0.082 (SE = 0.000), confirming that each additional year of education is associated with an approximate 8.2% increase in income. The R² increases substantially to 0.167, indicating that education explains a significant portion of the variation in logged income.
Interpretation Summary:
The intercept represents the expected log income of a male, unmarried individual with zero years of schooling. 1. Across all models, the negative and highly significant coefficient on SEX suggests a persistent and substantial gender income gap.
Marital status has a positive and growing effect as more variables are included.
Educational attainment emerges as a powerful predictor of income, both strengthening the marital status effect and increasing overall model fit. A one-unit increase in years of schooling is associated with an 8.2% increase in income, holding other variables constant.
The increase in R² from 0.019 to 0.167 demonstrates the importance of including education in income regressions.
The negative coefficient on the female indicator (−0.376) implies that, controlling for education and marital status, women earn approximately 31% less than men, supporting the existence of a significant gender wage gap.
These results reinforce the conclusion that gender and education are central determinants of income in this dataset, and that controlling for both is necessary to accurately quantify disparities. This consistency suggests no obvious model misspecification. The high statistical significance of all coefficients (p < 0.001) combined with the large sample size gives strong support to the reliability of these estimates.
The increase in R² from 0.019 in Models 1 and 2 to 0.167 in Model 3 further suggests that education is a key explanatory factor for variations in income.
The primary objective of this analysis is to examine how gender affects earned income, while accounting for key individual characteristics that may also influence earnings. To this end, the model is specified as a linear regression where the dependent variable is:
Logged Earned Income (log_INCEARN): Income is transformed using the natural logarithm to reduce the influence of extreme values, address heteroskedasticity (as identified in residual plots and formal tests), and allow for percentage-based interpretations of coefficients.
The main independent variable of interest is:
Gender (SEX): A binary variable coded as 0 for males and 1 for females. This variable captures the differential income associated with gender identity and serves as the central focus of the analysis.
To isolate the impact of gender, the model incorporates several control variables that are well-established determinants of income:
Marital Status (MARST): Being married or in a union can affect income through social support, household specialization, or labor market attachment. Including marital status helps control for these differences and reduces omitted variable bias.
Years of Schooling (YRSCHOOL): Educational attainment is a strong predictor of labor market earnings. Controlling for education ensures that the estimated gender gap is not confounded by differences in human capital.
The model is estimated using the following specifications:
Model 1: A bivariate regression with SEX as the only predictor.
Model 2: Adds MARST to assess whether marital status explains part of the gender income gap.
Model 3: Includes YRSCHOOL to control for educational attainment.
This stepwise inclusion of controls helps reveal how the estimated effect of gender evolves as more covariates are introduced, enhancing both interpretability and transparency.
Given the large sample size and the prior detection of heteroskedasticity, heteroskedasticity-consistent standard errors (HC0) are used to obtain reliable inference across all model specifications.
Overall, this model design is appropriate for addressing the research question, which seeks to quantify and interpret the gender wage gap while holding constant other key socioeconomic factors.
In our regression, the exogeneity assumption implies that Sex is not correlated to the error term is that the error term has a zero mean when conditional on sex, and thus Sex is an exogenous variable. Given that there is no variable bias, the Sex variable is a good indicator of earned income, and credible to the casual inference of the gender income gap. For this assumption to hold, we control for variable correlated to both earned income and Sex, including Years of Schooling and Martial Status. The correlation between covariates suggests that there would be omitted variable bias if these controls were not included. Another factor contributing to the validity of the assumption is there is no reverse causality. Earned income cannot affect the gender of a person at birth. The theoretical insights offer justification for the assumption since missing variable cannot be observed and there is no instrumental variable to isolate variation in Sex, a naturally assigned variable.
As we are using a multivariable OLS regression, certain design factors should be met. In our design, we assume that the key independent variable (Sex) is exogenous. As previously stated, this means that sex cannot be correlated to the errors term and most likely is not as gender assigned at birth cannot be chosen or influenced by other cofounding variables. Sex is also naturally randomly assigned using pre-existing groups that has no treatment and does need to be simulated like in a quasi- experiment like Differences-in-Differences. As such, variation in the regression is accounted for and well-specified. Since we are not studying the effect of external policy change or institutional rule, the study does not exploit that variation in the treatment. We then use a cross-sectional design conditional on the observables which follows six assumptions that allow unbiased, efficient, and reliable estimates of a causal effect of the gender income disparity in Mexico. This includes linear in parameter, random sampling, no perfect multicollinearity, zero conditional mean, homoskedasticity, and normal distribution of the error terms. Survey sampling has been conducted from a credible database IPUMS. By controlling for more covariates like years of schooling and marital status, the zero conditional mean assumption is more likely to hold and isolate causal effect of gender on earned income. It is worth noting that the IPUMS has limited observables variables in which restrict the number of variables in the regression we wish to have used including worker experience, number of children, and head of household. Since the study found heteroskedastic variation in earned income (see explanation above), robust regression has been formed to accommodate new best linear unbiased estimates. As such, a normal distribute in procured, allowing inference test and confidence interval to be more accurate.
model_3 <- feols(log_INCEARN ~ SEX + MARST + YRSCHOOL, data = data_clean)
## NOTE: 357,537 observations removed because of NA values (LHS: 357,537).
# Present results in publication-style table
modelsummary(
list(
"Model 3" = model_3
),
stars = TRUE,
statistic = "std.error",
gof_map = tibble::tribble(
~raw, ~clean, ~fmt,
"nobs", "Num. Obs.", 0,
"r.squared", "R²", 3,
"adj.r.squared", "Adj. R²", 3
),
title = "Regression Results: Inclusion of Control Variables"
)
| Model 3 | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 7.950*** |
| (0.002) | |
| SEX | -0.376*** |
| (0.001) | |
| MARST | 0.106*** |
| (0.001) | |
| YRSCHOOL | 0.082*** |
| (0.000) | |
| Num. Obs. | 2397639 |
| R² | 0.167 |
| Adj. R² | 0.167 |
Table 4: Regression Results: Inclusion of Control Variables
This section presents the results of the regression analysis assessing the effect of gender, marital status, and education on individual earnings in Mexico. Given the p-values, represented by stars, and the small standard errors, all estimated coefficients are statistically significant at the 1% level in the robust regression. This may be affected by the large sample size where observations are just below 2.4 million, leading to large variation which can be seen in a small R^2 where the current regression explains 16.7% of variation in Earned Income as well as small standard errors (around 0.001). However, this does not neglect the causal relationship in the gender income gap. When a worker is female, earned income decreases by 37.6% compared to men, holding all other factors constant. A confidence interval helps access the size of effect for gender and income regardless of the implications from the large sample size. Throughout repeated sampling, the constructed confidence interval for men [8.48,8.52] will cover the population regression in 95% of cases. Whereas he constructed confidence interval for women [8.18,8.22] will cover the population regression in 95% of cases. Since the interval for men and women do not overlap – seen numerically and graphically, there is a large difference in mean log(income) in which we are 95% confidence that men earn more than women. When converted from logs to levels, this implies that men earn approximately 35% more than women, on average, holding other factors constant.
Economically this is important as it represents gender discrimination in economic structures of the wage setting equations, taken advantage of by employers. Eventually, the gender income gap socially disadvantages the lives of women due to lower income, potential in areas such as retirement, healthcare, financial dependence, and personal autonomy. Understanding the gender income gap highlights the gender injustices from the workplace and bids policymakers to act. The negative relationship between females and earned income when compared to men is to be expected as previous economic literature has also highlighted the existence of a gender pay inequalities albeit using different methods, designs, and variables. The table also shows the causal effect of Years of Schooling and Marital Status, which again the signs are to be expected. The more education a worker has, the more the worker has to offer the company, especially in higher income occupations. For each additional year of schooling, earned income is predicted to increase by 8.2% increase, underscoring the strong returns to education, holding all else equal. Marital Status’s positive sign was also expected as a married person has various advantages where a possible dual income could allow time for job relocation, employer perception might see married individuals as stable and likely to stay within the company, or trends may follow the Becker theory where couple may split responsibilities (one fulfilling domestic roles while the other focuses on the labor market). Given that the worker is married, the predicted earned income increases by 10.6% than unmarried individuals, all else equal.
The findings from this analysis reveal persistent and statistically significant disparities in income associated with gender, controlling for marital status and education. Most notably, the results indicate that women earn approximately 31% less than men. This substantial gender wage gap has important implications across multiple social and economic domains, especially the context of Latin American practices. In surrounding literature, these results support the human capital theory, as individual return such as education effects income discrimination. Although this may be true, other unobserved factors can contribute to the effect such as spatial distribution in which other research design could capture. Given the results of an existing gender income gap in Mexico as of 2020, this trend follows historically proven trends of gender pay inequalities of the past, yet more should be studied. Above, limitations of this study have been state including database limitations invariable which potentially effects the explained variation and variables in the error term although this method assumes that the conditional mean of the error term is zero. This is hard to hold in practice. This is seen in our IPUMS data where the regression originally included the number of children born to an individual, indicating that parents need higher income jobs to sustain a family. However, this variable in IPUMS only account for mothers and not fathers, taking out the base variable “male” when running regression. Consequently, this variable could not be controlled for. IPUMS also restricts other controls when the database has not collected samples from Mexico in 2020 including variables for experience, cognitive ability, head of household, or firm size.
In terms of methodology and design, it would be interested to extend the study into using differential as González and Ángel (2020) has done to explain the averages in variables between men and females with the Blinder-Oaxaca method. Different quasi-experiments would be interesting to introduce as well since the continuing trend of the gender income gap is significant. Panel data can be included to show the change in the relationship between gender and earned income over time with the fixed effect. When looking a legislation, Difference-in-Difference can investigate the effect before and after its implementation and its impact on closing the gender gap. These natural experiments could be useful as they do are no assumption heavy like OLS (but they still have a few assumptions like the parallel trend assumption in Difference-in-Difference). As the difference in income due to gender is a critical social injustice, further analysis is needed and used for policymakers and advocacy groups so that labor market follows equal opportunity seen in wage transparency and anti-discrimination laws. Given the clear returns to education, expanding access to higher education for underrepresented populations, including working mothers and rural women, remains essential to addressing the gender income gap. In terms of the positive correlation been marital status and earned income, male biased policies for family-related benefits or tax policies must also be reviewed.