library(tidyverse)
library(readxl)
wages <- read_excel("Wage_GenderDS.xlsx") |>
mutate(
l_wage = log(Wage),
Gender = ifelse(Female == 1, "Women", "Men")
)ECON 465 Quiz: Wage Gender Gap
Part 1: Distribution of Wage
1. Histogram of Wage
ggplot(wages, aes(x = Wage)) +
geom_histogram(binwidth = 20, fill = "steelblue", color = "white") +
labs(title = "Distribution of Hourly Wage",
x = "Wage (dollar/hour)", y = "Count") +
theme_minimal()The distribution is right-skewed. Most workers earn between 30-150 dollars/hour, but a long tail of high-wage earners pulls the mean above the median. This is typical of wage data.
2. Boxplot of Wage by Gender
ggplot(wages, aes(x = Gender, y = Wage, fill = Gender)) +
geom_boxplot() +
scale_fill_manual(values = c("Men" = "pink", "Women" = "salmon")) +
labs(title = "Wage Distribution by Gender",
x = "Gender", y = "Wage (dollar/hour)") +
theme_minimal() +
theme(legend.position = "none")Median: Men (111 dollar) earn more than women (83.50 dollar). Men have a wider IQR. Both groups have outliers but men reach higher values (384 vs 364 dollar).
3. Summary Statistics by Gender
wages |>
group_by(Gender) |>
summarise(
Mean = round(mean(Wage), 2),
Median = round(median(Wage), 2),
SD = round(sd(Wage), 2),
Min = min(Wage),
Max = max(Wage)
)# A tibble: 2 × 6
Gender Mean Median SD Min Max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Men 125. 111 57.3 38 384
2 Women 97.3 83.5 46.3 32 364
mean_men <- wages |> filter(Female == 0) |> pull(Wage) |> mean()
mean_women <- wages |> filter(Female == 1) |> pull(Wage) |> mean()
cat("Mean wage Men: $", round(mean_men, 2), "\n")Mean wage Men: $ 125.13
cat("Mean wage Women: $", round(mean_women, 2), "\n")Mean wage Women: $ 97.33
cat("Raw wage gap: $", round(mean_men - mean_women, 2), "\n")Raw wage gap: $ 27.81
Raw wage gap = 27.81 dollar/hour. Men earn on average 125.13 dollar compared to women’s 97.33 dollar.
Part 2: Log Transformation
1. Histogram of log(Wage)
ggplot(wages, aes(x = l_wage)) +
geom_histogram(binwidth = 0.2, fill = "purple", color = "white") +
labs(title = "Distribution of log(Wage)",
x = "log(Wage)", y = "Count") +
theme_minimal()The raw Wage histogram was right-skewed. After the log transformation, l_wage is much more symmetric and approximately bell-shaped. This confirms wages follow a log-normal distribution.
2. Boxplot of log(Wage) by Gender
ggplot(wages, aes(x = Gender, y = l_wage, fill = Gender)) +
geom_boxplot() +
scale_fill_manual(values = c("Men" = "steelblue", "Women" = "salmon")) +
labs(title = "log(Wage) Distribution by Gender",
x = "Gender", y = "log(Wage)") +
theme_minimal() +
theme(legend.position = "none")The gap remains visible but is more proportional. Economists prefer l_wage because: (1) differences correspond to percentage changes, which is more meaningful; (2) log wages satisfy normality assumptions for regression.
3. Approximate Percentage Gap
mean_lw_men <- wages |> filter(Female == 0) |> pull(l_wage) |> mean()
mean_lw_women <- wages |> filter(Female == 1) |> pull(l_wage) |> mean()
pct_gap <- 100 * (mean_lw_men - mean_lw_women)
cat("Mean l_wage Men: ", round(mean_lw_men, 4), "\n")Mean l_wage Men: 4.7336
cat("Mean l_wage Women: ", round(mean_lw_women, 4), "\n")Mean l_wage Women: 4.483
cat("Approx % gap: ", round(pct_gap, 2), "%\n")Approx % gap: 25.06 %
The approximate percentage wage gap is 25.06%. Men earn roughly 25% more per hour than women.
Part 3: Exploring Confounders
1. Education Levels by Gender
wages |>
count(Gender, Educ) |>
pivot_wider(names_from = Gender, values_from = n, values_fill = 0) |>
arrange(Educ)# A tibble: 4 × 3
Educ Men Women
<dbl> <int> <int>
1 1 108 88
2 2 77 57
3 3 72 33
4 4 59 6
wages |>
group_by(Gender) |>
count(Educ) |>
mutate(Proportion = round(n / sum(n), 3))# A tibble: 8 × 4
# Groups: Gender [2]
Gender Educ n Proportion
<chr> <dbl> <int> <dbl>
1 Men 1 108 0.342
2 Men 2 77 0.244
3 Men 3 72 0.228
4 Men 4 59 0.187
5 Women 1 88 0.478
6 Women 2 57 0.31
7 Women 3 33 0.179
8 Women 4 6 0.033
Most common education level: Women - Educ 1 (88 women, 47.8%). Men - Educ 1 as well (108 men, 34.2%), but men are more evenly distributed across higher levels. Women are disproportionately concentrated at the lowest education level.
2. Part-time Work by Gender
wages |>
group_by(Gender) |>
summarise(
N_Total = n(),
N_Parttime = sum(Parttime),
Proportion = round(mean(Parttime), 3)
)# A tibble: 2 × 4
Gender N_Total N_Parttime Proportion
<chr> <int> <dbl> <dbl>
1 Men 316 71 0.225
2 Women 184 103 0.56
Women are much more likely to work part-time (56%) compared to men (22.5%). Part-time workers earn lower hourly wages and accumulate less experience, which contributes to the raw wage gap beyond pure discrimination.
3. Age Distribution
wages |>
group_by(Gender) |>
summarise(
Mean_Age = round(mean(Age), 2),
Median_Age = median(Age)
)# A tibble: 2 × 3
Gender Mean_Age Median_Age
<chr> <dbl> <dbl>
1 Men 40.0 39
2 Women 39.9 39
Mean and median age are virtually identical between men (mean: 40.05, median: 39) and women (mean: 39.94, median: 39). Age is unlikely to explain the wage gap in this dataset.
Part 4: Interpretation
1. Why use log(Wage) instead of Wage?
Two reasons economists use log wages:
Percentage interpretation: Differences in log wages correspond to percentage differences. Saying “women earn 25% less” is more meaningful than “women earn 28 dollars less” since a dollar difference means different things at different wage levels.
Statistical properties: Raw wages are strongly right-skewed, violating OLS normality assumptions. Log wages are approximately normally distributed, making regression estimates more reliable and reducing the influence of extreme outliers.
2. Is the raw wage gap the same as discrimination?
No. The exploration reveals two important confounders:
Part-time work: 56% of women work part-time vs only 22.5% of men. Part-time positions offer lower wages and fewer advancement opportunities, mechanically reducing average female wages even without pay discrimination.
Education: Women are more concentrated at lower education levels (Educ 1 = 48% of women vs 34% of men). Part of the raw gap reflects differences in human capital, not discrimination.
The raw gap of 27.81 dollar/hour is an unconditional difference. To assess discrimination, one must control for education, experience, hours worked, and occupation. Only the remaining unexplained gap is a closer measure of discrimination.
*AI Use Log: I wrote the code and analysis myself. I consulted AI only to troubleshoot specific errors that came up during the process.