ECON 465 Quiz: Wage Gender Gap

Author

Sude Arslan

library(tidyverse)
library(readxl)

wages <- read_excel("Wage_GenderDS.xlsx") |>
  mutate(
    l_wage = log(Wage),
    Gender = ifelse(Female == 1, "Women", "Men")
  )

Part 1: Distribution of Wage

1. Histogram of Wage

ggplot(wages, aes(x = Wage)) +
  geom_histogram(binwidth = 20, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Hourly Wage",
       x = "Wage (dollar/hour)", y = "Count") +
  theme_minimal()

The distribution is right-skewed. Most workers earn between 30-150 dollars/hour, but a long tail of high-wage earners pulls the mean above the median. This is typical of wage data.

2. Boxplot of Wage by Gender

ggplot(wages, aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Men" = "pink", "Women" = "salmon")) +
  labs(title = "Wage Distribution by Gender",
       x = "Gender", y = "Wage (dollar/hour)") +
  theme_minimal() +
  theme(legend.position = "none")

Median: Men (111 dollar) earn more than women (83.50 dollar). Men have a wider IQR. Both groups have outliers but men reach higher values (384 vs 364 dollar).

3. Summary Statistics by Gender

wages |>
  group_by(Gender) |>
  summarise(
    Mean   = round(mean(Wage), 2),
    Median = round(median(Wage), 2),
    SD     = round(sd(Wage), 2),
    Min    = min(Wage),
    Max    = max(Wage)
  )
# A tibble: 2 × 6
  Gender  Mean Median    SD   Min   Max
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Men    125.   111    57.3    38   384
2 Women   97.3   83.5  46.3    32   364
mean_men   <- wages |> filter(Female == 0) |> pull(Wage) |> mean()
mean_women <- wages |> filter(Female == 1) |> pull(Wage) |> mean()
cat("Mean wage Men:   $", round(mean_men, 2), "\n")
Mean wage Men:   $ 125.13 
cat("Mean wage Women: $", round(mean_women, 2), "\n")
Mean wage Women: $ 97.33 
cat("Raw wage gap:    $", round(mean_men - mean_women, 2), "\n")
Raw wage gap:    $ 27.81 

Raw wage gap = 27.81 dollar/hour. Men earn on average 125.13 dollar compared to women’s 97.33 dollar.

Part 2: Log Transformation

1. Histogram of log(Wage)

ggplot(wages, aes(x = l_wage)) +
  geom_histogram(binwidth = 0.2, fill = "purple", color = "white") +
  labs(title = "Distribution of log(Wage)",
       x = "log(Wage)", y = "Count") +
  theme_minimal()

The raw Wage histogram was right-skewed. After the log transformation, l_wage is much more symmetric and approximately bell-shaped. This confirms wages follow a log-normal distribution.

2. Boxplot of log(Wage) by Gender

ggplot(wages, aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Men" = "steelblue", "Women" = "salmon")) +
  labs(title = "log(Wage) Distribution by Gender",
       x = "Gender", y = "log(Wage)") +
  theme_minimal() +
  theme(legend.position = "none")

The gap remains visible but is more proportional. Economists prefer l_wage because: (1) differences correspond to percentage changes, which is more meaningful; (2) log wages satisfy normality assumptions for regression.

3. Approximate Percentage Gap

mean_lw_men   <- wages |> filter(Female == 0) |> pull(l_wage) |> mean()
mean_lw_women <- wages |> filter(Female == 1) |> pull(l_wage) |> mean()
pct_gap       <- 100 * (mean_lw_men - mean_lw_women)

cat("Mean l_wage Men:   ", round(mean_lw_men, 4), "\n")
Mean l_wage Men:    4.7336 
cat("Mean l_wage Women: ", round(mean_lw_women, 4), "\n")
Mean l_wage Women:  4.483 
cat("Approx % gap:      ", round(pct_gap, 2), "%\n")
Approx % gap:       25.06 %

The approximate percentage wage gap is 25.06%. Men earn roughly 25% more per hour than women.

Part 3: Exploring Confounders

1. Education Levels by Gender

wages |>
  count(Gender, Educ) |>
  pivot_wider(names_from = Gender, values_from = n, values_fill = 0) |>
  arrange(Educ)
# A tibble: 4 × 3
   Educ   Men Women
  <dbl> <int> <int>
1     1   108    88
2     2    77    57
3     3    72    33
4     4    59     6
wages |>
  group_by(Gender) |>
  count(Educ) |>
  mutate(Proportion = round(n / sum(n), 3))
# A tibble: 8 × 4
# Groups:   Gender [2]
  Gender  Educ     n Proportion
  <chr>  <dbl> <int>      <dbl>
1 Men        1   108      0.342
2 Men        2    77      0.244
3 Men        3    72      0.228
4 Men        4    59      0.187
5 Women      1    88      0.478
6 Women      2    57      0.31 
7 Women      3    33      0.179
8 Women      4     6      0.033

Most common education level: Women - Educ 1 (88 women, 47.8%). Men - Educ 1 as well (108 men, 34.2%), but men are more evenly distributed across higher levels. Women are disproportionately concentrated at the lowest education level.

2. Part-time Work by Gender

wages |>
  group_by(Gender) |>
  summarise(
    N_Total    = n(),
    N_Parttime = sum(Parttime),
    Proportion = round(mean(Parttime), 3)
  )
# A tibble: 2 × 4
  Gender N_Total N_Parttime Proportion
  <chr>    <int>      <dbl>      <dbl>
1 Men        316         71      0.225
2 Women      184        103      0.56 

Women are much more likely to work part-time (56%) compared to men (22.5%). Part-time workers earn lower hourly wages and accumulate less experience, which contributes to the raw wage gap beyond pure discrimination.

3. Age Distribution

wages |>
  group_by(Gender) |>
  summarise(
    Mean_Age   = round(mean(Age), 2),
    Median_Age = median(Age)
  )
# A tibble: 2 × 3
  Gender Mean_Age Median_Age
  <chr>     <dbl>      <dbl>
1 Men        40.0         39
2 Women      39.9         39

Mean and median age are virtually identical between men (mean: 40.05, median: 39) and women (mean: 39.94, median: 39). Age is unlikely to explain the wage gap in this dataset.

Part 4: Interpretation

1. Why use log(Wage) instead of Wage?

Two reasons economists use log wages:

  1. Percentage interpretation: Differences in log wages correspond to percentage differences. Saying “women earn 25% less” is more meaningful than “women earn 28 dollars less” since a dollar difference means different things at different wage levels.

  2. Statistical properties: Raw wages are strongly right-skewed, violating OLS normality assumptions. Log wages are approximately normally distributed, making regression estimates more reliable and reducing the influence of extreme outliers.

2. Is the raw wage gap the same as discrimination?

No. The exploration reveals two important confounders:

  • Part-time work: 56% of women work part-time vs only 22.5% of men. Part-time positions offer lower wages and fewer advancement opportunities, mechanically reducing average female wages even without pay discrimination.

  • Education: Women are more concentrated at lower education levels (Educ 1 = 48% of women vs 34% of men). Part of the raw gap reflects differences in human capital, not discrimination.

The raw gap of 27.81 dollar/hour is an unconditional difference. To assess discrimination, one must control for education, experience, hours worked, and occupation. Only the remaining unexplained gap is a closer measure of discrimination.


*AI Use Log: I wrote the code and analysis myself. I consulted AI only to troubleshoot specific errors that came up during the process.