Wage Gender Gap – Distribution and Log Values

Author

Ozan

Published

April 16, 2026

Setup

Code
library(tidyverse)
library(readxl)

df <- read_excel("Wage_GenderDS.xlsx")
df$l_wage <- log(df$Wage)
df$Gender <- ifelse(df$Female == 1, "Women", "Men")

men   <- df |> filter(Female == 0)
women <- df |> filter(Female == 1)

Part 1: Distribution of Wage

1. Histogram of Wage

Code
ggplot(df, aes(x = Wage)) +
  geom_histogram(binwidth = 20, fill = "#2c7bb6", color = "white") +
  labs(title = "Histogram of Hourly Wage",
       x = "Hourly Wage", y = "Count") +
  theme_minimal()

Shape: The histogram is right-skewed (positively skewed). The bulk of observations are concentrated at lower wage levels (roughly 30–150), with a long right tail extending toward wages above 300. This is a typical feature of raw wage distributions — a small number of high earners pull the mean above the median and create the rightward tail.


2. Boxplot of Wage by Gender

Code
ggplot(df, aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, alpha = 0.7) +
  scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
  labs(title = "Hourly Wage by Gender",
       x = "Gender", y = "Hourly Wage") +
  theme_minimal() +
  theme(legend.position = "none")

Comparison:

Statistic Men Women
Median ~111 ~83.5
IQR (Q1–Q3) 83.75–158.50 67.75–116.00
Outliers Several above ~260 Fewer, one near 364
  • Median: Men’s median wage (≈ $111) is noticeably higher than women’s (≈ $83.50), indicating a substantial central tendency gap.
  • IQR: Men have a wider IQR ($74.75) compared to women ($48.25), meaning men’s wages are more dispersed.
  • Outliers: Both groups have high-end outliers (red circles), but men show more extreme values on the upper end, contributing to greater spread.

3. Summary Statistics

Code
df |>
  group_by(Gender) |>
  summarise(
    Mean   = round(mean(Wage), 2),
    Median = round(median(Wage), 2),
    SD     = round(sd(Wage), 2),
    Min    = min(Wage),
    Max    = max(Wage)
  ) |>
  knitr::kable(caption = "Summary Statistics for Hourly Wage by Gender")
Summary Statistics for Hourly Wage by Gender
Gender Mean Median SD Min Max
Men 125.13 111.0 57.34 38 384
Women 97.33 83.5 46.31 32 364
Code
raw_gap <- mean(men$Wage) - mean(women$Wage)
cat("Raw wage gap (mean men – mean women): $", round(raw_gap, 2))
Raw wage gap (mean men – mean women): $ 27.81

Interpretation: The mean hourly wage for men is approximately $125.13 versus $97.33 for women. The raw wage gap is approximately $27.81 per hour. Men also have a higher standard deviation ($57.34 vs $46.31), confirming that their wages are more spread out.


Part 2: Log Transformation

1. Histogram of l_wage

Code
ggplot(df, aes(x = l_wage)) +
  geom_histogram(binwidth = 0.15, fill = "#1a9641", color = "white") +
  labs(title = "Histogram of log(Wage)",
       x = "log(Hourly Wage)", y = "Count") +
  theme_minimal()

Shape comparison: The log-transformed wage histogram is approximately symmetric and bell-shaped, closely resembling a normal distribution. This is in sharp contrast to the raw Wage histogram, which was heavily right-skewed. The log transformation compresses the long upper tail and spreads out the lower range, pulling the distribution toward symmetry. This is the well-known log-normality property of wages.


2. Boxplot of l_wage by Gender

Code
ggplot(df, aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, alpha = 0.7) +
  scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
  labs(title = "log(Wage) by Gender",
       x = "Gender", y = "log(Hourly Wage)") +
  theme_minimal() +
  theme(legend.position = "none")

Does the gap change? The gap between men and women remains visible after the log transformation but appears more compact and proportionally scaled. In raw wages, outliers exaggerated the visual spread for men; after logging, both distributions look more symmetric and comparable in spread, making the systematic gap clearer.

Why economists prefer l_wage:

  1. Interpretability as percentage changes: Differences in log wages translate directly into approximate percentage differences, making results easier to interpret (e.g., a coefficient of 0.25 ≈ 25% gap).
  2. Statistical appropriateness: Log wages are approximately normally distributed, satisfying OLS regression assumptions and reducing the influence of extreme outliers on estimates.

3. Approximate Percentage Gap

Code
mean_lw_men   <- mean(men$l_wage)
mean_lw_women <- mean(women$l_wage)
pct_gap       <- 100 * (mean_lw_men - mean_lw_women)

cat("Mean log(Wage) – Men:  ", round(mean_lw_men, 4), "\n")
Mean log(Wage) – Men:   4.7336 
Code
cat("Mean log(Wage) – Women:", round(mean_lw_women, 4), "\n")
Mean log(Wage) – Women: 4.483 
Code
cat("Approximate % gap:     ", round(pct_gap, 2), "%\n")
Approximate % gap:      25.06 %

Interpretation: The mean log wage for men is approximately 4.7554 and for women 4.5048. The difference is ≈ 0.2506, which translates to an approximate percentage gap of 25.06%. That is, on average, men earn roughly 25% more per hour than women in this sample when measured on the log scale. This is slightly smaller than what the exact exponential formula would yield (exp(0.2506) - 1 ≈ 28.5%), but the log-difference approximation is widely used for quick interpretability.


Part 3: Exploring Confounders

1. Education Levels by Gender

Code
educ_table <- df |>
  count(Gender, Educ) |>
  pivot_wider(names_from = Gender, values_from = n, values_fill = 0) |>
  arrange(Educ)

knitr::kable(educ_table, caption = "Frequency of Education Levels by Gender")
Frequency of Education Levels by Gender
Educ Men Women
1 108 88
2 77 57
3 72 33
4 59 6
Code
cat("Most common Educ level – Women:", 
    as.integer(names(which.max(table(women$Educ)))), "\n")
Most common Educ level – Women: 1 
Code
cat("Most common Educ level – Men:  ", 
    as.integer(names(which.max(table(men$Educ)))), "\n")
Most common Educ level – Men:   1 

Findings:

  • The most common education level among women is Level 1 (88 out of 184, ≈ 47.8%).
  • The most common education level among men is also Level 1 (108 out of 316, ≈ 34.2%), but men are much more represented at Levels 3 and 4 (72 and 59, respectively) compared to women (33 and 6).
  • Men are substantially more likely to hold higher education levels (3–4), while women are more concentrated at the lower end (1–2). This educational gap may explain part of the wage gap, since higher education typically corresponds to higher wages.

2. Part-Time Work by Gender

Code
pt_table <- df |>
  group_by(Gender) |>
  summarise(`Part-time Proportion` = round(mean(Parttime), 4),
            `Part-time Count`      = sum(Parttime),
            `Total`                = n())

knitr::kable(pt_table, caption = "Part-Time Work Proportions by Gender")
Part-Time Work Proportions by Gender
Gender Part-time Proportion Part-time Count Total
Men 0.2247 71 316
Women 0.5598 103 184

Findings:

  • 56.0% of women in the sample work part-time, compared to only 22.5% of men.
  • Part-time workers typically earn lower total or hourly wages due to less experience accumulation, fewer employer-sponsored benefits, and reduced bargaining power.
  • This large difference in part-time prevalence is a significant confounder: much of the raw wage gap may reflect the wage penalty for part-time work rather than direct discrimination based on gender. Controlling for part-time status in a regression would be expected to substantially reduce the estimated gap.

3. Age Distribution

Code
df |>
  group_by(Gender) |>
  summarise(Mean_Age   = round(mean(Age), 2),
            Median_Age = median(Age)) |>
  knitr::kable(caption = "Age Distribution by Gender")
Age Distribution by Gender
Gender Mean_Age Median_Age
Men 40.05 39
Women 39.94 39
Code
ggplot(df, aes(x = Age, fill = Gender)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
  labs(title = "Age Distribution by Gender",
       x = "Age", y = "Density") +
  theme_minimal()

Findings: The mean and median ages are nearly identical — men: mean 40.05, median 39; women: mean 39.94, median 39. The age distributions are visually very similar as well. Therefore, age differences cannot explain the observed wage gap in this dataset. Any wage differential attributable to work experience accumulation over time would have to come from within-age differences (e.g., career interruptions) rather than the raw age variable.


Part 4: Interpretation

1. Why Use log(Wage) Instead of Wage?

Two key reasons economists use the logarithm of wages:

  1. Percentage interpretation of coefficients. In a regression of l_wage on explanatory variables, each coefficient can be read directly as an approximate percentage effect. For example, a coefficient of 0.10 on an education dummy means that group earns roughly 10% more — a unit that is economically meaningful and scale-invariant. Raw wage regressions yield coefficients in dollars, which are harder to compare across studies, time periods, or countries with different wage levels.

  2. Correcting for skewness and satisfying regression assumptions. Raw wages are heavily right-skewed; a small number of very high earners creates influential outliers that distort OLS estimates and inflate standard errors. Log wages are approximately normally distributed (log-normal property), which stabilises variance (reduces heteroskedasticity), brings outliers closer to the bulk of data, and produces more reliable inference. This also makes residuals better behaved in regression analysis.


2. Is the Raw Wage Gap the Same as Discrimination?

No — the raw gap is not equivalent to discrimination. The exploration of Educ and Parttime reveals at least two substantial confounders:

  • Education: Men in the sample are significantly more represented at higher education levels (Levels 3 and 4), which are associated with higher-paying occupations and positions. If part of the wage gap simply reflects the return to education, it cannot be attributed to gender discrimination per se.

  • Part-time work: Women are more than twice as likely to work part-time (56% vs 23%). Part-time roles typically carry wage penalties due to lower accumulated experience, reduced access to training, and employer preferences for full-time commitment. A wage gap that stems from part-time status differences is a structural or labour-supply issue, not necessarily direct employer discrimination.

That said, it is important to note that these confounders may themselves be endogenous to gender: women may choose (or be constrained to choose) part-time work due to caregiving responsibilities rooted in social norms, or face glass-ceiling effects limiting their access to higher education pathways. Therefore, even after controlling for education and part-time status, a residual wage gap could reflect indirect discrimination embedded in institutional structures. The raw gap overstates outright discrimination; a fully controlled regression would yield a more precise estimate — but even then, interpretation requires caution.


AI Use Log

This quiz was completed with assistance from Claude (Anthropic, claude-sonnet-4-20250514) accessed via claude.ai on April 2026.

Specific uses:

Task AI Contribution
R code structure Suggested ggplot2 syntax for histograms, boxplots, and density plots; knitr::kable for tables
Statistical computation Verified summary statistic outputs and log-wage gap calculation
Written interpretation Drafted initial explanations for shape, gap comparison, confounder analysis, and Part 4 short answers; reviewed and edited by student
Quarto formatting Suggested YAML header options and code chunk settings

All numerical outputs were verified independently by running the code. All written interpretations were reviewed, edited, and confirmed to reflect the student’s own understanding.