Wage Gender Gap – Distribution and Log Values

Author

cerenmuratsu

Published

April 16, 2026

Code
wage <- read_excel("Wage_GenderDS.xlsx")

# Add log wage and gender label
wage <- wage |>
  mutate(
    l_wage  = log(Wage),
    gender  = factor(Female, levels = c(0, 1), labels = c("Men", "Women"))
  )

glimpse(wage)
Rows: 500
Columns: 8
$ Observation <dbl> 119, 2, 41, 65, 246, 254, 74, 12, 9, 237, 79, 294, 182, 25…
$ Wage        <dbl> 32, 34, 37, 38, 38, 38, 39, 40, 42, 43, 44, 45, 46, 46, 47…
$ Female      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Age         <dbl> 31, 42, 31, 33, 21, 28, 31, 28, 25, 25, 44, 25, 31, 42, 38…
$ Educ        <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1…
$ Parttime    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ l_wage      <dbl> 3.465736, 3.526361, 3.610918, 3.637586, 3.637586, 3.637586…
$ gender      <fct> Women, Women, Women, Women, Women, Men, Women, Women, Wome…

Part 1: Distribution of Wage

Q1 · Histogram of Wage

Code
ggplot(wage, aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "#4472C4", colour = "white", alpha = 0.85) +
  labs(
    title = "Distribution of Hourly Wage",
    x     = "Hourly Wage ($)",
    y     = "Count"
  ) +
  theme_minimal(base_size = 13)

Shape: The histogram is right-skewed (positively skewed). Most workers earn wages in the lower-to-middle range (roughly $32–$150), while a long right tail extends toward higher wages (up to ~$384). This is the typical pattern for income distributions: the majority cluster near the median, and a smaller number of high earners pull the mean above the median.


Q2 · Boxplot of Wage by Gender

Code
ggplot(wage, aes(x = gender, y = Wage, fill = gender)) +
  geom_boxplot(alpha = 0.75, outlier.colour = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(
    title = "Hourly Wage by Gender",
    x     = NULL,
    y     = "Hourly Wage ($)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Code
wage |>
  group_by(gender) |>
  summarise(
    Median = median(Wage),
    Q1     = quantile(Wage, 0.25),
    Q3     = quantile(Wage, 0.75),
    IQR    = IQR(Wage)
  ) |>
  kable(digits = 1, caption = "Boxplot Summary – Wage by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Boxplot Summary – Wage by Gender
gender Median Q1 Q3 IQR
Men 111.0 83.8 158.5 74.8
Women 83.5 67.8 116.0 48.2

Comparison:

  • Median: Men’s median wage is visibly higher than women’s, indicating that the typical male worker earns more.
  • IQR: Men have a wider IQR, suggesting greater wage dispersion among men.
  • Outliers: Both groups have high-wage outliers (red dots), but men appear to have more extreme upper outliers.

Q3 · Summary Statistics by Gender

Code
stats <- wage |>
  group_by(gender) |>
  summarise(
    Mean   = mean(Wage),
    Median = median(Wage),
    SD     = sd(Wage),
    Min    = min(Wage),
    Max    = max(Wage),
    N      = n()
  )

stats |>
  kable(digits = 2, caption = "Summary Statistics of Wage by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Summary Statistics of Wage by Gender
gender Mean Median SD Min Max N
Men 125.13 111.0 57.34 38 384 316
Women 97.33 83.5 46.31 32 364 184
Code
mean_men   <- stats$Mean[stats$gender == "Men"]
mean_women <- stats$Mean[stats$gender == "Women"]
raw_gap    <- mean_men - mean_women

cat("Mean wage – Men:  $", round(mean_men, 2), "\n")
Mean wage – Men:  $ 125.13 
Code
cat("Mean wage – Women:$", round(mean_women, 2), "\n")
Mean wage – Women:$ 97.33 
Code
cat("Raw wage gap (Men – Women): $", round(raw_gap, 2), "\n")
Raw wage gap (Men – Women): $ 27.81 

Raw wage gap: The difference in mean wages between men and women is approximately $27.81 per hour, with men earning more on average.


Part 2: Log Transformation

Q1 · Create log(Wage) and Histogram

Code
ggplot(wage, aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "#70AD47", colour = "white", alpha = 0.85) +
  labs(
    title = "Distribution of log(Hourly Wage)",
    x     = "log(Wage)",
    y     = "Count"
  ) +
  theme_minimal(base_size = 13)

Comparison with raw Wage: The log-transformed histogram is much more symmetric and approximately bell-shaped, compared to the strong right-skew of the raw wage histogram. The log transformation compresses large values and stretches small values, bringing the distribution closer to normality—a key assumption in many regression models.


Q2 · Boxplot of l_wage by Gender

Code
ggplot(wage, aes(x = gender, y = l_wage, fill = gender)) +
  geom_boxplot(alpha = 0.75, outlier.colour = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(
    title = "log(Wage) by Gender",
    x     = NULL,
    y     = "log(Wage)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Does the gap change? The gap between men and women remains visible after the log transformation, but it appears more proportional and symmetric. Economists prefer l_wage for two main reasons:

  1. Proportional interpretation: Differences in log wages represent percentage differences, which are more meaningful than raw dollar gaps (a $20 gap means different things at $50/hr vs $500/hr).
  2. Better-behaved residuals: Log wages are more normally distributed, which satisfies OLS regression assumptions and produces more reliable inference.

Q3 · Approximate Percentage Gap

Code
l_stats <- wage |>
  group_by(gender) |>
  summarise(mean_l_wage = mean(l_wage))

l_stats |>
  kable(digits = 4, caption = "Mean of log(Wage) by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Mean of log(Wage) by Gender
gender mean_l_wage
Men 4.7336
Women 4.4830
Code
mean_l_men   <- l_stats$mean_l_wage[l_stats$gender == "Men"]
mean_l_women <- l_stats$mean_l_wage[l_stats$gender == "Women"]
pct_gap      <- 100 * (mean_l_men - mean_l_women)

cat("Mean log(Wage) – Men:   ", round(mean_l_men, 4), "\n")
Mean log(Wage) – Men:    4.7336 
Code
cat("Mean log(Wage) – Women: ", round(mean_l_women, 4), "\n")
Mean log(Wage) – Women:  4.483 
Code
cat("Approximate % gap:      ", round(pct_gap, 2), "%\n")
Approximate % gap:       25.06 %

Approximate percentage gap: Men earn approximately 25.1% more per hour than women on average, based on the log-wage difference.


Part 3: Exploring Confounders

Q1 · Education Levels by Gender

Code
educ_table <- wage |>
  group_by(gender, Educ) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(gender) |>
  mutate(Proportion = round(n / sum(n), 3)) |>
  pivot_wider(
    names_from  = gender,
    values_from = c(n, Proportion),
    names_glue  = "{gender}_{.value}"
  )

educ_table |>
  kable(caption = "Education Level Frequency by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Education Level Frequency by Gender
Educ Men_n Women_n Men_Proportion Women_Proportion
1 108 88 0.342 0.478
2 77 57 0.244 0.310
3 72 33 0.228 0.179
4 59 6 0.187 0.033
Code
wage |>
  group_by(gender, Educ) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(gender) |>
  slice_max(n, n = 1) |>
  kable(caption = "Most Common Education Level by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Most Common Education Level by Gender
gender Educ n
Men 1 108
Women 1 88

Interpretation: The table reveals the education distribution for each gender. Differences in education composition could explain part of the observed wage gap if one gender is more concentrated at lower education levels.


Q2 · Part-Time Work by Gender

Code
pt <- wage |>
  group_by(gender) |>
  summarise(
    N_total    = n(),
    N_parttime = sum(Parttime == 1),
    Proportion = mean(Parttime == 1)
  )

pt |>
  kable(digits = 3, caption = "Part-Time Work Proportion by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Part-Time Work Proportion by Gender
gender N_total N_parttime Proportion
Men 316 71 0.225
Women 184 103 0.560

Interpretation: If women are more likely to work part-time than men, this mechanically lowers their average hourly earnings—even if part-time and full-time rates within the same job are equal. Part-time roles may also offer fewer opportunities for advancement, bonuses, or employer-provided benefits that affect total compensation.


Q3 · Age Distribution by Gender

Code
wage |>
  group_by(gender) |>
  summarise(
    Mean_Age   = mean(Age),
    Median_Age = median(Age),
    SD_Age     = sd(Age)
  ) |>
  kable(digits = 2, caption = "Age Distribution by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Age Distribution by Gender
gender Mean_Age Median_Age SD_Age
Men 40.05 39 10.60
Women 39.94 39 11.26
Code
ggplot(wage, aes(x = gender, y = Age, fill = gender)) +
  geom_boxplot(alpha = 0.75) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(title = "Age Distribution by Gender", x = NULL, y = "Age") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation: If men and women in this sample have similar mean and median ages, then age is unlikely to explain the wage gap. However, if men are older on average, they may have accumulated more experience and seniority, which legitimately increases wages through the human capital channel—not discrimination.


Part 4: Interpretation

Q1 · Why use log(Wage) instead of Wage?

Two key reasons why economists use log wages when analysing gender gaps:

  1. Percentage interpretation. A coefficient on a dummy variable (e.g., Female) in a log-wage regression directly estimates the approximate percentage wage penalty/premium. This is more economically meaningful than a raw dollar difference, which does not account for the scale of wages.

  2. Normality and regression validity. Raw wages are right-skewed, violating the normality assumption of OLS residuals. Log wages are approximately normally distributed, making inference (t-tests, F-tests, confidence intervals) more reliable and the model less sensitive to high-wage outliers.


Q2 · Is the Raw Wage Gap the Same as Discrimination?

No. The raw (unconditional) wage gap is a simple average difference, but it does not control for factors that legitimately differ between men and women and independently affect wages. From our analysis:

  • Education: If men and women differ in their distribution across education levels (Educ 1–4), some of the wage gap reflects returns to education, not discrimination.
  • Part-time work: Women appear to work part-time at a higher rate. Part-time positions typically pay less and offer fewer advancement opportunities. This compositional difference inflates the observed gap.
  • Age/experience: Small differences in average age could reflect differences in labour market experience, a key determinant of wages under human capital theory.

To isolate the portion attributable to discrimination (unexplained gap), we would need a multiple regression of l_wage on Female, Educ, Age, Parttime, and other controls. The coefficient on Female after controlling for these variables gives the adjusted gap—what cannot be explained by observable characteristics. Only that residual portion may reflect discriminatory treatment.


AI Use Log: This analysis was structured with the assistance of an AI assistant (Claude, Anthropic). The AI helped generate R code for data visualisation, summary statistics, and the Quarto document structure. All statistical interpretations and written answers reflect the student’s own understanding.