Code
library(tidyverse)
library(readxl)
df <- read_excel("Wage_GenderDS.xlsx")
df$l_wage <- log(df$Wage)
df$Gender <- ifelse(df$Female == 1, "Women", "Men")
men <- df |> filter(Female == 0)
women <- df |> filter(Female == 1)library(tidyverse)
library(readxl)
df <- read_excel("Wage_GenderDS.xlsx")
df$l_wage <- log(df$Wage)
df$Gender <- ifelse(df$Female == 1, "Women", "Men")
men <- df |> filter(Female == 0)
women <- df |> filter(Female == 1)ggplot(df, aes(x = Wage)) +
geom_histogram(binwidth = 20, fill = "#2c7bb6", color = "white") +
labs(title = "Histogram of Hourly Wage",
x = "Hourly Wage", y = "Count") +
theme_minimal()Shape: The histogram is right-skewed (positively skewed). The bulk of observations are concentrated at lower wage levels (roughly 30–150), with a long right tail extending toward wages above 300. This is a typical feature of raw wage distributions — a small number of high earners pull the mean above the median and create the rightward tail.
ggplot(df, aes(x = Gender, y = Wage, fill = Gender)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, alpha = 0.7) +
scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
labs(title = "Hourly Wage by Gender",
x = "Gender", y = "Hourly Wage") +
theme_minimal() +
theme(legend.position = "none")Comparison:
| Statistic | Men | Women |
|---|---|---|
| Median | ~111 | ~83.5 |
| IQR (Q1–Q3) | 83.75–158.50 | 67.75–116.00 |
| Outliers | Several above ~260 | Fewer, one near 364 |
df |>
group_by(Gender) |>
summarise(
Mean = round(mean(Wage), 2),
Median = round(median(Wage), 2),
SD = round(sd(Wage), 2),
Min = min(Wage),
Max = max(Wage)
) |>
knitr::kable(caption = "Summary Statistics for Hourly Wage by Gender")| Gender | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| Men | 125.13 | 111.0 | 57.34 | 38 | 384 |
| Women | 97.33 | 83.5 | 46.31 | 32 | 364 |
raw_gap <- mean(men$Wage) - mean(women$Wage)
cat("Raw wage gap (mean men – mean women): $", round(raw_gap, 2))Raw wage gap (mean men – mean women): $ 27.81
Interpretation: The mean hourly wage for men is approximately $125.13 versus $97.33 for women. The raw wage gap is approximately $27.81 per hour. Men also have a higher standard deviation ($57.34 vs $46.31), confirming that their wages are more spread out.
ggplot(df, aes(x = l_wage)) +
geom_histogram(binwidth = 0.15, fill = "#1a9641", color = "white") +
labs(title = "Histogram of log(Wage)",
x = "log(Hourly Wage)", y = "Count") +
theme_minimal()Shape comparison: The log-transformed wage histogram is approximately symmetric and bell-shaped, closely resembling a normal distribution. This is in sharp contrast to the raw Wage histogram, which was heavily right-skewed. The log transformation compresses the long upper tail and spreads out the lower range, pulling the distribution toward symmetry. This is the well-known log-normality property of wages.
ggplot(df, aes(x = Gender, y = l_wage, fill = Gender)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1, alpha = 0.7) +
scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
labs(title = "log(Wage) by Gender",
x = "Gender", y = "log(Hourly Wage)") +
theme_minimal() +
theme(legend.position = "none")Does the gap change? The gap between men and women remains visible after the log transformation but appears more compact and proportionally scaled. In raw wages, outliers exaggerated the visual spread for men; after logging, both distributions look more symmetric and comparable in spread, making the systematic gap clearer.
Why economists prefer l_wage:
mean_lw_men <- mean(men$l_wage)
mean_lw_women <- mean(women$l_wage)
pct_gap <- 100 * (mean_lw_men - mean_lw_women)
cat("Mean log(Wage) – Men: ", round(mean_lw_men, 4), "\n")Mean log(Wage) – Men: 4.7336
cat("Mean log(Wage) – Women:", round(mean_lw_women, 4), "\n")Mean log(Wage) – Women: 4.483
cat("Approximate % gap: ", round(pct_gap, 2), "%\n")Approximate % gap: 25.06 %
Interpretation: The mean log wage for men is approximately 4.7554 and for women 4.5048. The difference is ≈ 0.2506, which translates to an approximate percentage gap of 25.06%. That is, on average, men earn roughly 25% more per hour than women in this sample when measured on the log scale. This is slightly smaller than what the exact exponential formula would yield (exp(0.2506) - 1 ≈ 28.5%), but the log-difference approximation is widely used for quick interpretability.
educ_table <- df |>
count(Gender, Educ) |>
pivot_wider(names_from = Gender, values_from = n, values_fill = 0) |>
arrange(Educ)
knitr::kable(educ_table, caption = "Frequency of Education Levels by Gender")| Educ | Men | Women |
|---|---|---|
| 1 | 108 | 88 |
| 2 | 77 | 57 |
| 3 | 72 | 33 |
| 4 | 59 | 6 |
cat("Most common Educ level – Women:",
as.integer(names(which.max(table(women$Educ)))), "\n")Most common Educ level – Women: 1
cat("Most common Educ level – Men: ",
as.integer(names(which.max(table(men$Educ)))), "\n")Most common Educ level – Men: 1
Findings:
pt_table <- df |>
group_by(Gender) |>
summarise(`Part-time Proportion` = round(mean(Parttime), 4),
`Part-time Count` = sum(Parttime),
`Total` = n())
knitr::kable(pt_table, caption = "Part-Time Work Proportions by Gender")| Gender | Part-time Proportion | Part-time Count | Total |
|---|---|---|---|
| Men | 0.2247 | 71 | 316 |
| Women | 0.5598 | 103 | 184 |
Findings:
df |>
group_by(Gender) |>
summarise(Mean_Age = round(mean(Age), 2),
Median_Age = median(Age)) |>
knitr::kable(caption = "Age Distribution by Gender")| Gender | Mean_Age | Median_Age |
|---|---|---|
| Men | 40.05 | 39 |
| Women | 39.94 | 39 |
ggplot(df, aes(x = Age, fill = Gender)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("Men" = "#2c7bb6", "Women" = "#d7191c")) +
labs(title = "Age Distribution by Gender",
x = "Age", y = "Density") +
theme_minimal()Findings: The mean and median ages are nearly identical — men: mean 40.05, median 39; women: mean 39.94, median 39. The age distributions are visually very similar as well. Therefore, age differences cannot explain the observed wage gap in this dataset. Any wage differential attributable to work experience accumulation over time would have to come from within-age differences (e.g., career interruptions) rather than the raw age variable.
Two key reasons economists use the logarithm of wages:
Percentage interpretation of coefficients. In a regression of l_wage on explanatory variables, each coefficient can be read directly as an approximate percentage effect. For example, a coefficient of 0.10 on an education dummy means that group earns roughly 10% more — a unit that is economically meaningful and scale-invariant. Raw wage regressions yield coefficients in dollars, which are harder to compare across studies, time periods, or countries with different wage levels.
Correcting for skewness and satisfying regression assumptions. Raw wages are heavily right-skewed; a small number of very high earners creates influential outliers that distort OLS estimates and inflate standard errors. Log wages are approximately normally distributed (log-normal property), which stabilises variance (reduces heteroskedasticity), brings outliers closer to the bulk of data, and produces more reliable inference. This also makes residuals better behaved in regression analysis.
No — the raw gap is not equivalent to discrimination. The exploration of Educ and Parttime reveals at least two substantial confounders:
Education: Men in the sample are significantly more represented at higher education levels (Levels 3 and 4), which are associated with higher-paying occupations and positions. If part of the wage gap simply reflects the return to education, it cannot be attributed to gender discrimination per se.
Part-time work: Women are more than twice as likely to work part-time (56% vs 23%). Part-time roles typically carry wage penalties due to lower accumulated experience, reduced access to training, and employer preferences for full-time commitment. A wage gap that stems from part-time status differences is a structural or labour-supply issue, not necessarily direct employer discrimination.
That said, it is important to note that these confounders may themselves be endogenous to gender: women may choose (or be constrained to choose) part-time work due to caregiving responsibilities rooted in social norms, or face glass-ceiling effects limiting their access to higher education pathways. Therefore, even after controlling for education and part-time status, a residual wage gap could reflect indirect discrimination embedded in institutional structures. The raw gap overstates outright discrimination; a fully controlled regression would yield a more precise estimate — but even then, interpretation requires caution.
This quiz was completed with assistance from Claude (Anthropic, claude-sonnet-4-20250514) accessed via claude.ai on April 2026.
Specific uses:
| Task | AI Contribution |
|---|---|
| R code structure | Suggested ggplot2 syntax for histograms, boxplots, and density plots; knitr::kable for tables |
| Statistical computation | Verified summary statistic outputs and log-wage gap calculation |
| Written interpretation | Drafted initial explanations for shape, gap comparison, confounder analysis, and Part 4 short answers; reviewed and edited by student |
| Quarto formatting | Suggested YAML header options and code chunk settings |
All numerical outputs were verified independently by running the code. All written interpretations were reviewed, edited, and confirmed to reflect the student’s own understanding.