Wage Gender Gap – Distribution and Log Values

Author

cerenmuratsu

Published

April 16, 2026

Code

wage <- read_excel("Wage_GenderDS.xlsx")

# Add log wage and gender label
wage <- wage |>
  mutate(
    l_wage  = log(Wage),
    gender  = factor(Female, levels = c(0, 1), labels = c("Men", "Women"))
  )

glimpse(wage)

Rows: 500
Columns: 8
$ Observation <dbl> 119, 2, 41, 65, 246, 254, 74, 12, 9, 237, 79, 294, 182, 25…
$ Wage        <dbl> 32, 34, 37, 38, 38, 38, 39, 40, 42, 43, 44, 45, 46, 46, 47…
$ Female      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Age         <dbl> 31, 42, 31, 33, 21, 28, 31, 28, 25, 25, 44, 25, 31, 42, 38…
$ Educ        <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1…
$ Parttime    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ l_wage      <dbl> 3.465736, 3.526361, 3.610918, 3.637586, 3.637586, 3.637586…
$ gender      <fct> Women, Women, Women, Women, Women, Men, Women, Women, Wome…

Part 1: Distribution of Wage

Q1 · Histogram of Wage

Code

ggplot(wage, aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "#4472C4", colour = "white", alpha = 0.85) +
  labs(
    title = "Distribution of Hourly Wage",
    x     = "Hourly Wage ($)",
    y     = "Count"
  ) +
  theme_minimal(base_size = 13)

Shape: The histogram is right-skewed (positively skewed). Most workers earn wages in the lower-to-middle range (roughly $32–$150), while a long right tail extends toward higher wages (up to ~$384). This is the typical pattern for income distributions: the majority cluster near the median, and a smaller number of high earners pull the mean above the median.

Q2 · Boxplot of Wage by Gender

Code

ggplot(wage, aes(x = gender, y = Wage, fill = gender)) +
  geom_boxplot(alpha = 0.75, outlier.colour = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(
    title = "Hourly Wage by Gender",
    x     = NULL,
    y     = "Hourly Wage ($)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Code

wage |>
  group_by(gender) |>
  summarise(
    Median = median(Wage),
    Q1     = quantile(Wage, 0.25),
    Q3     = quantile(Wage, 0.75),
    IQR    = IQR(Wage)
  ) |>
  kable(digits = 1, caption = "Boxplot Summary – Wage by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Boxplot Summary – Wage by Gender
gender	Median	Q1	Q3	IQR
Men	111.0	83.8	158.5	74.8
Women	83.5	67.8	116.0	48.2

Comparison:

Median: Men’s median wage is visibly higher than women’s, indicating that the typical male worker earns more.
IQR: Men have a wider IQR, suggesting greater wage dispersion among men.
Outliers: Both groups have high-wage outliers (red dots), but men appear to have more extreme upper outliers.

Q3 · Summary Statistics by Gender

Code

stats <- wage |>
  group_by(gender) |>
  summarise(
    Mean   = mean(Wage),
    Median = median(Wage),
    SD     = sd(Wage),
    Min    = min(Wage),
    Max    = max(Wage),
    N      = n()
  )

stats |>
  kable(digits = 2, caption = "Summary Statistics of Wage by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Summary Statistics of Wage by Gender
gender	Mean	Median	SD	Min	Max	N
Men	125.13	111.0	57.34	38	384	316
Women	97.33	83.5	46.31	32	364	184

Code

mean_men   <- stats$Mean[stats$gender == "Men"]
mean_women <- stats$Mean[stats$gender == "Women"]
raw_gap    <- mean_men - mean_women

cat("Mean wage – Men:  $", round(mean_men, 2), "\n")

Mean wage – Men:  $ 125.13

Code

cat("Mean wage – Women:$", round(mean_women, 2), "\n")

Mean wage – Women:$ 97.33

Code

cat("Raw wage gap (Men – Women): $", round(raw_gap, 2), "\n")

Raw wage gap (Men – Women): $ 27.81

Raw wage gap: The difference in mean wages between men and women is approximately $27.81 per hour, with men earning more on average.

Part 2: Log Transformation

Q1 · Create log(Wage) and Histogram

Code

ggplot(wage, aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "#70AD47", colour = "white", alpha = 0.85) +
  labs(
    title = "Distribution of log(Hourly Wage)",
    x     = "log(Wage)",
    y     = "Count"
  ) +
  theme_minimal(base_size = 13)

Comparison with raw Wage: The log-transformed histogram is much more symmetric and approximately bell-shaped, compared to the strong right-skew of the raw wage histogram. The log transformation compresses large values and stretches small values, bringing the distribution closer to normality—a key assumption in many regression models.

Q2 · Boxplot of l_wage by Gender

Code

ggplot(wage, aes(x = gender, y = l_wage, fill = gender)) +
  geom_boxplot(alpha = 0.75, outlier.colour = "red", outlier.size = 1.5) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(
    title = "log(Wage) by Gender",
    x     = NULL,
    y     = "log(Wage)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Does the gap change? The gap between men and women remains visible after the log transformation, but it appears more proportional and symmetric. Economists prefer l_wage for two main reasons:

Proportional interpretation: Differences in log wages represent percentage differences, which are more meaningful than raw dollar gaps (a $20 gap means different things at $50/hr vs $500/hr).
Better-behaved residuals: Log wages are more normally distributed, which satisfies OLS regression assumptions and produces more reliable inference.

Q3 · Approximate Percentage Gap

Code

l_stats <- wage |>
  group_by(gender) |>
  summarise(mean_l_wage = mean(l_wage))

l_stats |>
  kable(digits = 4, caption = "Mean of log(Wage) by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Mean of log(Wage) by Gender
gender	mean_l_wage
Men	4.7336
Women	4.4830

Code

mean_l_men   <- l_stats$mean_l_wage[l_stats$gender == "Men"]
mean_l_women <- l_stats$mean_l_wage[l_stats$gender == "Women"]
pct_gap      <- 100 * (mean_l_men - mean_l_women)

cat("Mean log(Wage) – Men:   ", round(mean_l_men, 4), "\n")

Mean log(Wage) – Men:    4.7336

Code

cat("Mean log(Wage) – Women: ", round(mean_l_women, 4), "\n")

Mean log(Wage) – Women:  4.483

Code

cat("Approximate % gap:      ", round(pct_gap, 2), "%\n")

Approximate % gap:       25.06 %

Approximate percentage gap: Men earn approximately 25.1% more per hour than women on average, based on the log-wage difference.

Part 3: Exploring Confounders

Q1 · Education Levels by Gender

Code

educ_table <- wage |>
  group_by(gender, Educ) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(gender) |>
  mutate(Proportion = round(n / sum(n), 3)) |>
  pivot_wider(
    names_from  = gender,
    values_from = c(n, Proportion),
    names_glue  = "{gender}_{.value}"
  )

educ_table |>
  kable(caption = "Education Level Frequency by Gender") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Education Level Frequency by Gender
Educ	Men_n	Women_n	Men_Proportion	Women_Proportion
1	108	88	0.342	0.478
2	77	57	0.244	0.310
3	72	33	0.228	0.179
4	59	6	0.187	0.033

Code

wage |>
  group_by(gender, Educ) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(gender) |>
  slice_max(n, n = 1) |>
  kable(caption = "Most Common Education Level by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Most Common Education Level by Gender
gender	Educ	n
Men	1	108
Women	1	88

Interpretation: The table reveals the education distribution for each gender. Differences in education composition could explain part of the observed wage gap if one gender is more concentrated at lower education levels.

Q2 · Part-Time Work by Gender

Code

pt <- wage |>
  group_by(gender) |>
  summarise(
    N_total    = n(),
    N_parttime = sum(Parttime == 1),
    Proportion = mean(Parttime == 1)
  )

pt |>
  kable(digits = 3, caption = "Part-Time Work Proportion by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Part-Time Work Proportion by Gender
gender	N_total	N_parttime	Proportion
Men	316	71	0.225
Women	184	103	0.560

Interpretation: If women are more likely to work part-time than men, this mechanically lowers their average hourly earnings—even if part-time and full-time rates within the same job are equal. Part-time roles may also offer fewer opportunities for advancement, bonuses, or employer-provided benefits that affect total compensation.

Q3 · Age Distribution by Gender

Code

wage |>
  group_by(gender) |>
  summarise(
    Mean_Age   = mean(Age),
    Median_Age = median(Age),
    SD_Age     = sd(Age)
  ) |>
  kable(digits = 2, caption = "Age Distribution by Gender") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Age Distribution by Gender
gender	Mean_Age	Median_Age	SD_Age
Men	40.05	39	10.60
Women	39.94	39	11.26

Code

ggplot(wage, aes(x = gender, y = Age, fill = gender)) +
  geom_boxplot(alpha = 0.75) +
  scale_fill_manual(values = c("#4472C4", "#ED7D31")) +
  labs(title = "Age Distribution by Gender", x = NULL, y = "Age") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation: If men and women in this sample have similar mean and median ages, then age is unlikely to explain the wage gap. However, if men are older on average, they may have accumulated more experience and seniority, which legitimately increases wages through the human capital channel—not discrimination.

Part 4: Interpretation

Q1 · Why use log(Wage) instead of Wage?

Two key reasons why economists use log wages when analysing gender gaps:

Percentage interpretation. A coefficient on a dummy variable (e.g., Female) in a log-wage regression directly estimates the approximate percentage wage penalty/premium. This is more economically meaningful than a raw dollar difference, which does not account for the scale of wages.
Normality and regression validity. Raw wages are right-skewed, violating the normality assumption of OLS residuals. Log wages are approximately normally distributed, making inference (t-tests, F-tests, confidence intervals) more reliable and the model less sensitive to high-wage outliers.

Q2 · Is the Raw Wage Gap the Same as Discrimination?

No. The raw (unconditional) wage gap is a simple average difference, but it does not control for factors that legitimately differ between men and women and independently affect wages. From our analysis:

Education: If men and women differ in their distribution across education levels (Educ 1–4), some of the wage gap reflects returns to education, not discrimination.
Part-time work: Women appear to work part-time at a higher rate. Part-time positions typically pay less and offer fewer advancement opportunities. This compositional difference inflates the observed gap.
Age/experience: Small differences in average age could reflect differences in labour market experience, a key determinant of wages under human capital theory.

To isolate the portion attributable to discrimination (unexplained gap), we would need a multiple regression of l_wage on Female, Educ, Age, Parttime, and other controls. The coefficient on Female after controlling for these variables gives the adjusted gap—what cannot be explained by observable characteristics. Only that residual portion may reflect discriminatory treatment.

AI Use Log: This analysis was structured with the assistance of an AI assistant (Claude, Anthropic). The AI helped generate R code for data visualisation, summary statistics, and the Quarto document structure. All statistical interpretations and written answers reflect the student’s own understanding.