Quiz 2 – Wage Gender Gap

Author

İsmet Erdal Tunç

## AI Use Log
I used ChatGPT to help me structure the Quarto document, generate R code for the requested graphs and summary tables, and improve the wording of my explanations. I checked the outputs myself and interpreted the results based on the dataset.
## Setup

library(readxl)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(knitr)

# Read the dataset
df <- read_excel("Wage Gender.xlsx")

# Check variable names
names(df)

[1] "Observation" "Wage"        "Female"      "Age"         "Educ"       
[6] "Parttime"

# Create log wage
df <- df %>%
  mutate(
    l_wage = log(Wage),
    Gender = ifelse(Female == 1, "Women", "Men"),
    Parttime = as.factor(Parttime),
    Educ = as.factor(Educ)
  )

Part 1: Distribution of Wage

1. Histogram of Wage

  ggplot(df, aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Raw Hourly Wage",
    x = "Wage",
    y = "Frequency"
  ) +
  theme_minimal()

Explanation:

The histogram of raw hourly wage is expected to be right-skewed. This means that most people earn wages around the lower-to-middle range, while a smaller number of people earn much higher wages, creating a long right tail.

2. Boxplot of Wage by Gender

ggplot(df, aes(x = Gender, y = Wage)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Boxplot of Wage by Gender",
    x = "Gender",
    y = "Wage"
  ) +
  theme_minimal()

Explanation:

This boxplot compares the wage distributions of men and women. I will compare the median, interquartile range, and outliers based on the graph. If men’s median wage is higher, this suggests a raw wage gap in favor of men. A larger IQR would indicate more variation in wages within that group.

3. Summary statistics by gender

summary_stats <- df %>%
  group_by(Gender) %>%
  summarise(
    mean_wage = mean(Wage, na.rm = TRUE),
    median_wage = median(Wage, na.rm = TRUE),
    sd_wage = sd(Wage, na.rm = TRUE),
    min_wage = min(Wage, na.rm = TRUE),
    max_wage = max(Wage, na.rm = TRUE)
  )

kable(summary_stats, digits = 2, caption = "Summary Statistics of Wage by Gender")

Summary Statistics of Wage by Gender
Gender	mean_wage	median_wage	sd_wage	min_wage	max_wage
Men	125.13	111.0	57.34	38	384
Women	97.33	83.5	46.31	32	364

4. Raw wage gap in dollars

mean_men <- df %>% filter(Female == 0) %>% summarise(mean_wage = mean(Wage, na.rm = TRUE)) %>% pull(mean_wage)
mean_women <- df %>% filter(Female == 1) %>% summarise(mean_wage = mean(Wage, na.rm = TRUE)) %>% pull(mean_wage)

raw_gap <- mean_men - mean_women
raw_gap

[1] 27.80682

Explanation:

The raw wage gap in dollars is calculated as the mean wage of men minus the mean wage of women. A positive number indicates that men earn more on average.

Part 2: Log Transformation

1. Histogram of log(Wage)

ggplot(df, aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "orange", color = "black") +
  labs(
    title = "Histogram of Log Wage",
    x = "log(Wage)",
    y = "Frequency"
  ) +
  theme_minimal()

Explanation:

Compared to the raw wage histogram, the distribution of log(Wage) is usually more symmetric and less strongly right-skewed. The log transformation compresses very large wage values and makes the distribution easier to analyze.

2. Boxplot of log(Wage) by Gender

ggplot(df, aes(x = Gender, y = l_wage)) +
  geom_boxplot(fill = "pink") +
  labs(
    title = "Boxplot of Log Wage by Gender",
    x = "Gender",
    y = "log(Wage)"
  ) +
  theme_minimal()

Explanation:

The log transformation may reduce the visual influence of extreme high wages. The gender gap may still remain visible, but the distributions are often easier to compare after taking logs. Economists often prefer log(wage) because differences in logs can be interpreted approximately as percentage differences.

3. Mean of log wage by gender

log_means <- df %>%
  group_by(Gender) %>%
  summarise(
    mean_l_wage = mean(l_wage, na.rm = TRUE)
  )

kable(log_means, digits = 3, caption = "Mean Log Wage by Gender")

Mean Log Wage by Gender
Gender	mean_l_wage
Men	4.734
Women	4.483

4. Approximate percentage gap

mean_log_men <- df %>% filter(Female == 0) %>% summarise(mean_l_wage = mean(l_wage, na.rm = TRUE)) %>% pull(mean_l_wage)
mean_log_women <- df %>% filter(Female == 1) %>% summarise(mean_l_wage = mean(l_wage, na.rm = TRUE)) %>% pull(mean_l_wage)

approx_gap <- 100 * (mean_log_men - mean_log_women)
approx_gap

[1] 25.06425

Explanation:

The approximate percentage wage gap is calculated as 100 × (mean log wage of men – mean log wage of women). This gives an approximate percentage difference in average wages.

Part 3: Exploring Confounders

1. Education levels by gender

educ_table <- table(df$Gender, df$Educ)
educ_table

       
          1   2   3   4
  Men   108  77  72  59
  Women  88  57  33   6

kable(educ_table, caption = "Education Levels by Gender")

Education Levels by Gender
	1	2	3	4
Men	108	77	72	59
Women	88	57	33	6

prop_educ <- prop.table(educ_table, margin = 1)
round(prop_educ, 3)

       
            1     2     3     4
  Men   0.342 0.244 0.228 0.187
  Women 0.478 0.310 0.179 0.033

kable(round(prop_educ, 3), caption = "Proportion of Education Levels within Each Gender")

Proportion of Education Levels within Each Gender
	1	2	3	4
Men	0.342	0.244	0.228	0.187
Women	0.478	0.310	0.179	0.033

Explanation:

This table shows the distribution of education levels separately for men and women. The most common education level for each group is the one with the highest frequency.

2. Part-time work by gender

parttime_stats <- df %>%
  group_by(Gender) %>%
  summarise(
    prop_parttime = mean(as.numeric(as.character(Parttime)) == 1, na.rm = TRUE)
  )

kable(parttime_stats, digits = 3, caption = "Proportion of Part-Time Workers by Gender")

Proportion of Part-Time Workers by Gender
Gender	prop_parttime
Men	0.225
Women	0.560

Explanation:

If women are more likely to work part-time, this may help explain part of the observed wage gap. Part-time jobs may pay lower wages on average or may be concentrated in lower-paying sectors.

3. Age distribution by gender

age_stats <- df %>%
  group_by(Gender) %>%
  summarise(
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE)
  )

kable(age_stats, digits = 2, caption = "Age Statistics by Gender")

Age Statistics by Gender
Gender	mean_age	median_age
Men	40.05	39
Women	39.94	39

Explanation:

If men and women have similar mean and median ages, age is less likely to explain much of the wage gap. If there is a noticeable difference, age may explain part of the gap because earnings often increase with work experience.

Part 4: Interpretation

Why use log(wage) instead of wage?

Economists use log(wage) because wages are usually right-skewed, and the log transformation makes the distribution more symmetric. In addition, differences in log wages can be interpreted approximately as percentage differences, which are easier to understand when analyzing wage gaps.

Is the raw wage gap the same as discrimination?

No, the raw wage gap is not the same as discrimination. It may also reflect differences in education, age, and part-time work. Therefore, the observed gap includes both possible discrimination and differences in worker characteristics.

Conclusion

In this analysis, I compared wages using both raw and log values. Log wages provide a more balanced distribution. Factors such as education, age, and part-time work may explain part of the wage gap, so it should not be interpreted only as discrimination.