Quiz: Wage Gender Gap – Distribution and Log Values

Author

Efe Colak

Published

Invalid Date

Setup

library(ggplot2)
library(dplyr)
library(readxl)

df <- read_excel("Wage_GenderDS.xlsx")
head(df)
# A tibble: 6 × 6
  Observation  Wage Female   Age  Educ Parttime
        <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
1         119    32      1    31     1        1
2           2    34      1    42     1        1
3          41    37      1    31     1        1
4          65    38      1    33     1        1
5         246    38      1    21     2        1
6         254    38      0    28     2        1

Part 1: Distribution of Wage

1. Histogram of Wage

ggplot(df, aes(x = Wage)) +
  geom_histogram(binwidth = 20, fill = "blue", color = "white") +
  labs(title = "Histogram of Raw Hourly Wage", x = "Wage", y = "Frequency") +
  theme_minimal()

Shape: It is skewed to the right (positively skewed). The workers are clustered around the lower-to-middle wage range, with a long tail that extends to the right with only a few high earners. This is a highly typical wage distribution, with the high earners at the top pulling the mean above the median.

2. Boxplot of Wage by Gender

df$Gender <- ifelse(df$Female == 1, "Women", "Men")

ggplot(df, aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Wage by Gender", x = "", y = "Wage") +
  theme_minimal() +
  theme(legend.position = "none")

Comparison:

  • Median wage: Men median wage (111) is significantly more than the median wage of women (83.5), that is, the average male worker has a higher salary when compared to the average female worker.
  • IQR: There is a broader interquartile in men suggesting a larger dispersion in their wages. The wages of women are more closely concentrated at the lower levels.
  • Outliers: there are high-wage outliers in both sample groups, although the outliers of men are further, indicating a more skewed upper tail of the male wage structure.

3. Summary Statistics

df %>%
  group_by(Gender) %>%
  summarise(
    Mean   = round(mean(Wage), 2),
    Median = round(median(Wage), 2),
    SD     = round(sd(Wage), 2),
    Min    = min(Wage),
    Max    = max(Wage)
  )
# A tibble: 2 × 6
  Gender  Mean Median    SD   Min   Max
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 Men    125.   111    57.3    38   384
2 Women   97.3   83.5  46.3    32   364
mean_men   <- mean(df$Wage[df$Female == 0])
mean_women <- mean(df$Wage[df$Female == 1])

cat("Raw wage gap (Men - Women):", round(mean_men - mean_women, 2))
Raw wage gap (Men - Women): 27.81

Raw wage gap: Men earn an average of more per hour, average wage of: **27.81 more than women. This is simply a descriptive number - it does not yet take into consideration variations in education, age and working hours. It informs us that there is a gap but not the reason why there is a gap.


Part 2: Log Transformation

1. Create log(Wage) and Histogram

df$l_wage <- log(df$Wage)

ggplot(df, aes(x = l_wage)) +
  geom_histogram(binwidth = 0.15, fill = "darkorange", color = "white") +
  labs(title = "Histogram of log(Wage)", x = "log(Wage)", y = "Frequency") +
  theme_minimal()

Comparison with raw Wage: The histogram, using the log transformation, is far more symmetric and bell-shaped, sharply contrasting with the strong right skew of the raw Wage histogram. The log transformation squeeches the long right tail, moving the high earners nearer to the middle of the distribution. This makes the data much more suitable for statistical analysis.


2. Boxplot of l_wage by Gender

ggplot(df, aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot() +
  labs(title = "log(Wage) by Gender", x = "", y = "log(Wage)") +
  theme_minimal() +
  theme(legend.position = "none")

Does a log transformation alter the apparent gap?: This difference between men and women is still vivid following the log transformation. The distributions are made more symmetric, and relative extreme outliers are not as dramatic, but the difference between the two groups is retained.

Why economists prefer l_wage:

  1. Percentage interpretation: In a log-wage regression the coefficient of a variable such as Female can be interpreted directly as a percentage difference. The difference in raw wages by a dollar is subject to the overall wage level, which makes it more difficult to compare.
  2. Improved statistical properties: Raw wages are skewed to the right and the variance is likely to increase with the wage level (heteroskedasticity). The log wages are more normally distributed and the variance is more stable which meets the OLS regression assumptions.

3. Approximate Percentage Gap

mean_lw_men   <- mean(df$l_wage[df$Female == 0])
mean_lw_women <- mean(df$l_wage[df$Female == 1])

cat("Mean log(Wage) - Men:  ", round(mean_lw_men, 4), "\n")
Mean log(Wage) - Men:   4.7336 
cat("Mean log(Wage) - Women:", round(mean_lw_women, 4), "\n")
Mean log(Wage) - Women: 4.483 
cat("Approximate % gap:     ", round(100 * (mean_lw_men - mean_lw_women), 2), "%")
Approximate % gap:      25.06 %

Interpretation: The wage gap is approximately 25.1%. This implies that men in this sample have an average of 25 per cent higher wages than women. This is a good approximation with small differences.


Part 3: Exploring Confounders

1. Education Levels by Gender

table(df$Gender, df$Educ)
       
          1   2   3   4
  Men   108  77  72  59
  Women  88  57  33   6
ggplot(df, aes(x = factor(Educ), fill = Gender)) +
  geom_bar(position = "dodge") +
  labs(title = "Education Level by Gender", x = "Education Level", y = "Count") +
  theme_minimal()

Which education level is most common?

  • Women: Education level 1 is the most common (88 women, ~47% of all women). There is a high concentration of women in the lower levels of education.
  • Men: Men are also distributed more evenly across the four levels with Education level 1 being the most prevalent among men (108 men).

The difference is important since the higher the level of education, the higher the wages are generally. The higher concentration of women in level 1 could be one of the reasons why women earn less.


2. Part-Time Work by Gender

df %>%
  group_by(Gender) %>%
  summarise(Parttime_Proportion = round(mean(Parttime), 3))
# A tibble: 2 × 2
  Gender Parttime_Proportion
  <chr>                <dbl>
1 Men                  0.225
2 Women                0.56 

What could be the impact of this on the wage gap? 56% of women work part-time, compared to only 22.5% of men. Part-time employees tend to have lower hourly wages, less experience, and they are more likely to be in lower pay jobs. This high part-time rate gap is probably one of the most significant structural reasons that contribute to the raw wage gap. It is not always an indication of discrimination, it can be that there is a difference in the cares giving roles or personal preferences, but it does imply that we cannot use the raw gap to infer pure discrimination.

3. Age Distribution

df %>%
  group_by(Gender) %>%
  summarise(
    Mean_Age   = round(mean(Age), 2),
    Median_Age = median(Age)
  )
# A tibble: 2 × 3
  Gender Mean_Age Median_Age
  <chr>     <dbl>      <dbl>
1 Men        40.0         39
2 Women      39.9         39

Are they similar? Yes — the mean and median age are nearly identical for both groups (both approximately 40 years old). This means age is unlikely to explain the wage gap in this dataset. If men were systematically older, we might attribute part of the wage gap to greater accumulated work experience, but that is not the case here.


Part 4: Interpretation

1. Why use log(wage) instead of wage?

Two major motives giving the logarithm of wages importance to economists as compared to raw wages include:

  1. Interpretation of results in percentages. We can directly interpret the coefficient on the female dummy, when we regress log(wage) on gender and control variables, as an approximation to a percentage wage difference (e.g., women earn 15 percent less than men, other things being equal). A regression coefficient of a raw-wage would be in dollars, which is less easily interpreted and compared across wage levels or time.

  2. The wage distribution is not regression-compliant. The distribution of raw wages is strongly skewed to the right and heteroskewed - the variance of the wages increases with the wage. The OLS regression is based on the assumption that regression residuals follow a normal distribution with equal variance. Log wages are much more consistent with these assumptions, and give coefficient estimates and standard errors that are more reliable.

2. Is the raw wage gap the same as discrimination?

No, the raw wage gap is not the same as discrimination. The $27.81 hourly gap (approximately 25%) is a descriptive statistic that mixes together many different factors. Our exploration of the data reveals at least two important confounders:

  • Part-time work: Women are more than twice as likely to work part-time (56% vs. 22.5%). Since part-time workers tend to earn lower wages, this structural difference alone can account for a significant portion of the observed gap — without any discrimination being involved.

  • Education: Women in this sample are more concentrated in lower education levels (Educ = 1), while men are more evenly spread across levels 1–4. Because education and wages are positively related, this compositional difference also contributes to the raw gap.

To isolate the portion of the wage gap that might reflect discrimination, we would need a multivariate regression that controls for education, age, part-time status, occupation, and other relevant variables. The unexplained residual after such controls is a closer — though still imperfect — measure of discrimination, since unobserved factors like occupational segregation and career interruptions can still bias the estimate.

AI Use Log

I get gelp from ai about some of the code schemes and page layout, to make sure published qmd looks clean and easy to read.

End of analysis.