ECON 465 – Quiz: Wage Gender Gap

Author

Ömer Faruk Yılmaz

Published

April 4, 2027

df <- read_excel("~/ECON465_DataScience/data/Wage_GenderDS.xlsx")
glimpse(df)
Rows: 500
Columns: 6
$ Observation <dbl> 119, 2, 41, 65, 246, 254, 74, 12, 9, 237, 79, 294, 182, 25…
$ Wage        <dbl> 32, 34, 37, 38, 38, 38, 39, 40, 42, 43, 44, 45, 46, 46, 47…
$ Female      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Age         <dbl> 31, 42, 31, 33, 21, 28, 31, 28, 25, 25, 44, 25, 31, 42, 38…
$ Educ        <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1…
$ Parttime    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…

1. Histogram of Wage

ggplot(df, aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Histogram of Hourly Wage", x = "Wage ($/hour)", y = "Count") +
  theme_minimal()

The distribution of unprocessed wages is right-skewed. While most workers are paid relatively modest salaries per hour, there exists an exceptional minority that drags the distribution’s tail towards the right side.


2. Boxplot of Wage by Gender

df <- df |> mutate(Gender = ifelse(Female == 1, "Women", "Men"))

ggplot(df, aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("Men" = "steelblue", "Women" = "orange")) +
  labs(title = "Boxplot of Wage by Gender", x = "Gender", y = "Wage ($/hour)") +
  theme_minimal() +
  theme(legend.position = "none")

Mens median wage is visibly higher than womens. The IQR for men is wider, and men have more extreme high wage outliers.


3. Summary Statistics by Gender

summary_stats <- df |>
  group_by(Gender) |>
  summarize(
    Mean   = round(mean(Wage, na.rm = TRUE), 2),
    Median = round(median(Wage, na.rm = TRUE), 2),
    SD     = round(sd(Wage, na.rm = TRUE), 2),
    Min    = round(min(Wage, na.rm = TRUE), 2),
    Max    = round(max(Wage, na.rm = TRUE), 2),
    .groups = "drop"
  )
summary_stats
# A tibble: 2 × 6
  Gender  Mean Median    SD   Min   Max
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Men    125.   111    57.3    38   384
2 Women   97.3   83.5  46.3    32   364
mean_men   <- df |> filter(Female == 0) |> summarize(m = mean(Wage)) |> pull(m)
mean_women <- df |> filter(Female == 1) |> summarize(m = mean(Wage)) |> pull(m)
raw_gap <- round(mean_men - mean_women, 2)
cat("Raw wage gap: $", raw_gap, "/ hour\n")
Raw wage gap: $ 27.81 / hour

The raw wage gap is $27.81/hour. This does not control for education age or part time status.


Part 2: Log Transformation

1. Create log(Wage) and Histogram

df <- df |> mutate(l_wage = log(Wage))

ggplot(df, aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(title = "Histogram of log(Wage)", x = "log(Wage)", y = "Count") +
  theme_minimal()

After log transformation, l_wage is much more symmetric and bell-shaped compared to raw wages.


2. Boxplot of l_wage by Gender

ggplot(df, aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("Men" = "steelblue", "Women" = "salmon")) +
  labs(title = "Boxplot of log(Wage) by Gender", x = "Gender", y = "log(Wage)") +
  theme_minimal() +
  theme(legend.position = "none")

The gap remains visible, but distributions become more symmetrical. The logarithm of wage, l_wage, is preferred by economists since it (1) equalizes the variance, (2) provides a relative change interpretation, and (3) wages follow the log-normal distribution.


3. Approximate Percentage Gap

mean_lwage_men   <- df |> filter(Female == 0) |> summarize(m = mean(l_wage, na.rm = TRUE)) |> pull(m)
mean_lwage_women <- df |> filter(Female == 1) |> summarize(m = mean(l_wage, na.rm = TRUE)) |> pull(m)
pct_gap <- round(100 * (mean_lwage_men - mean_lwage_women), 2)

cat("Mean log(wage) - Men:  ", round(mean_lwage_men, 4), "\n")
Mean log(wage) - Men:   4.7336 
cat("Mean log(wage) - Women:", round(mean_lwage_women, 4), "\n")
Mean log(wage) - Women: 4.483 
cat("Approximate % gap:     ", pct_gap, "%\n")
Approximate % gap:      25.06 %

Men earn approximately 25.06% more per hour than women based on the log wage difference.


Part 3: Exploring Confounders

1. Education Levels by Gender

educ_table <- df |>
  group_by(Gender, Educ) |>
  summarize(n = n(), .groups = "drop") |>
  group_by(Gender) |>
  mutate(proportion = round(n / sum(n), 3)) |>
  arrange(Gender, Educ)
educ_table
# A tibble: 8 × 4
# Groups:   Gender [2]
  Gender  Educ     n proportion
  <chr>  <dbl> <int>      <dbl>
1 Men        1   108      0.342
2 Men        2    77      0.244
3 Men        3    72      0.228
4 Men        4    59      0.187
5 Women      1    88      0.478
6 Women      2    57      0.31 
7 Women      3    33      0.179
8 Women      4     6      0.033
ggplot(df, aes(x = factor(Educ), fill = Gender)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Men" = "steelblue", "Women" = "salmon")) +
  labs(title = "Education Level by Gender",
       x = "Education Level (1=Lowest, 4=Highest)", y = "Count", fill = "Gender") +
  theme_minimal()

If women are more concentrated in lower education levels this could partly explain the wage gap.


2. Part-Time Work by Gender

parttime_stats <- df |>
  group_by(Gender) |>
  summarize(
    n_total    = n(),
    n_parttime = sum(Parttime == 1, na.rm = TRUE),
    proportion = round(mean(Parttime == 1, na.rm = TRUE), 3),
    .groups = "drop"
  )
parttime_stats
# A tibble: 2 × 4
  Gender n_total n_parttime proportion
  <chr>    <int>      <int>      <dbl>
1 Men        316         71      0.225
2 Women      184        103      0.56 

If women work part-time more often this directly reduces their average wages. Part-time status is an important confounder in interpreting the wage gap.


3. Age Distribution

age_stats <- df |>
  group_by(Gender) |>
  summarize(
    Mean_Age   = round(mean(Age, na.rm = TRUE), 2),
    Median_Age = round(median(Age, na.rm = TRUE), 2),
    .groups = "drop"
  )
age_stats
# A tibble: 2 × 3
  Gender Mean_Age Median_Age
  <chr>     <dbl>      <dbl>
1 Men        40.0         39
2 Women      39.9         39
ggplot(df, aes(x = Gender, y = Age, fill = Gender)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Men" = "steelblue", "Women" = "salmon")) +
  labs(title = "Age Distribution by Gender", x = "Gender", y = "Age") +
  theme_minimal() + theme(legend.position = "none")

If men are significantly older they have more experience and thus higher wages. Age is a potential confounder.


Part 4: Interpretation

1. Why use log(wage) instead of wage?

Percent change: The gap between the logarithms is a direct estimate of percent difference, providing a straightforward and scale-invariant measure.

Statistics: Wage data have a right skew, while log wages are normally distributed, meeting regression assumptions and mitigating extreme values.


2. Is the raw wage gap the same as discrimination?

No because Raw difference does not take into consideration any observable factor. Education, part time work and age could possibly be some factors that influence the size of the difference. Using Multivariate Regression will help determine the part that cannot be explained.


AI Use Log

I made use of Claude (Anthropic) AI for this quiz. The role played by Claude was in helping me fix any bugs in the R code and also clearing concepts that I found tricky to comprehend.