Quiz2 - Ali Yigit Ozdemir

# Load packages

library(tidyverse)

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(ggplot2)
library(knitr)

wage_data <- read_excel("Wage_GenderDS.xlsx")

wage_data <- wage_data |>
  mutate(
    gender = ifelse(Female == 1, "Women", "Men"),
    l_wage = log(Wage)
  )

glimpse(wage_data)

Rows: 500
Columns: 8
$ Observation <dbl> 119, 2, 41, 65, 246, 254, 74, 12, 9, 237, 79, 294, 182, 25…
$ Wage        <dbl> 32, 34, 37, 38, 38, 38, 39, 40, 42, 43, 44, 45, 46, 46, 47…
$ Female      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Age         <dbl> 31, 42, 31, 33, 21, 28, 31, 28, 25, 25, 44, 25, 31, 42, 38…
$ Educ        <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1…
$ Parttime    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ gender      <chr> "Women", "Women", "Women", "Women", "Women", "Men", "Women…
$ l_wage      <dbl> 3.465736, 3.526361, 3.610918, 3.637586, 3.637586, 3.637586…

wage_data <- wage_data |>
  mutate(
    gender = ifelse(Female == 1, "Women", "Men"),
    l_wage = log(Wage)
  )

summary(wage_data)

  Observation         Wage           Female           Age       
 Min.   :  1.0   Min.   : 32.0   Min.   :0.000   Min.   :20.00  
 1st Qu.:125.8   1st Qu.: 72.0   1st Qu.:0.000   1st Qu.:32.00  
 Median :250.5   Median :100.0   Median :0.000   Median :39.00  
 Mean   :250.5   Mean   :114.9   Mean   :0.368   Mean   :40.01  
 3rd Qu.:375.2   3rd Qu.:144.0   3rd Qu.:1.000   3rd Qu.:47.00  
 Max.   :500.0   Max.   :384.0   Max.   :1.000   Max.   :70.00  
      Educ          Parttime        gender              l_wage     
 Min.   :1.000   Min.   :0.000   Length:500         Min.   :3.466  
 1st Qu.:1.000   1st Qu.:0.000   Class :character   1st Qu.:4.277  
 Median :2.000   Median :0.000   Mode  :character   Median :4.605  
 Mean   :2.078   Mean   :0.348                      Mean   :4.641  
 3rd Qu.:3.000   3rd Qu.:1.000                      3rd Qu.:4.970  
 Max.   :4.000   Max.   :1.000                      Max.   :5.951

Part 1: Distribution of Wage

1. Histogram of Wage

ggplot(wage_data, aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Raw Hourly Wage",
    x = "Hourly Wage",
    y = "Count"
  ) +
  theme_minimal()

This histogram shows the distribution of raw hourly wages.

Describe whether the distribution is symmetric, left-skewed, or right-skewed.

In many wage datasets, the distribution is usually right-skewed, because a small number of individuals earn much higher wages than the rest.

2. Boxplot of Wage by Gender

ggplot(wage_data, aes(x = gender, y = Wage)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Boxplot of Wage by Gender",
    x = "Gender",
    y = "Hourly Wage"
  ) +
  theme_minimal()

These side-by-side boxplots compare wages for men and women.

You should compare the median, the interquartile range, and the outliers.

For example, you can check whether men have a higher median wage or whether one group has a wider spread.

3. Summary statistics of Wage by Gender

wage_summary <- wage_data |>
  group_by(gender) |>
  summarise(
    mean_wage = mean(Wage, na.rm = TRUE),
    median_wage = median(Wage, na.rm = TRUE),
    sd_wage = sd(Wage, na.rm = TRUE),
    min_wage = min(Wage, na.rm = TRUE),
    max_wage = max(Wage, na.rm = TRUE)
  )

kable(wage_summary, digits = 2, caption = "Summary Statistics of Wage by Gender")

Summary Statistics of Wage by Gender
gender	mean_wage	median_wage	sd_wage	min_wage	max_wage
Men	125.13	111.0	57.34	38	384
Women	97.33	83.5	46.31	32	364

4. Raw wage gap in dollars

mean_men <- wage_data |>
  filter(Female == 0) |>
  summarise(mean_wage = mean(Wage, na.rm = TRUE)) |>
  pull(mean_wage)

mean_women <- wage_data |>
  filter(Female == 1) |>
  summarise(mean_wage = mean(Wage, na.rm = TRUE)) |>
  pull(mean_wage)

raw_gap <- mean_men - mean_women

cat("Mean wage (Men):", round(mean_men, 2), "\n")

Mean wage (Men): 125.13

cat("Mean wage (Women):", round(mean_women, 2), "\n")

Mean wage (Women): 97.33

cat("Raw wage gap (Men - Women):", round(raw_gap, 2), "\n")

Raw wage gap (Men - Women): 27.81

The raw wage gap is calculated as the difference between the average wage of men and the average wage of women.

A positive value means men earn more on average in dollar terms.

Part 2: Log Transformation

1. Create log(Wage)

summary(wage_data$l_wage)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.466   4.277   4.605   4.641   4.970   5.951

2. Histogram of log(Wage)

ggplot(wage_data, aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Log Wage",
    x = "log(Wage)",
    y = "Count"
  ) +
  theme_minimal()

Compare this histogram with the histogram of raw Wage.

Usually, the log transformation makes the distribution more symmetric and reduces right-skewness.

This happens because very large wage values are compressed after taking logs.

3. Boxplot of log(Wage) by Gender

ggplot(wage_data, aes(x = gender, y = l_wage)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Boxplot of Log Wage by Gender",
    x = "Gender",
    y = "log(Wage)"
  ) +
  theme_minimal()

This boxplot compares log wages for men and women.

The visible gap may still remain, but the log transformation often makes the comparison easier because it reduces the influence of extreme wage values.

Economists often prefer log(wage) because: 1. it makes the wage distribution less skewed, 2. differences in logs can be interpreted approximately as percentage differences.

4. Mean of log(Wage) by Gender

log_summary <- wage_data |>
  group_by(gender) |>
  summarise(
    mean_l_wage = mean(l_wage, na.rm = TRUE)
  )

kable(log_summary, digits = 4, caption = "Mean Log Wage by Gender")

Mean Log Wage by Gender
gender	mean_l_wage
Men	4.7336
Women	4.4830

5. Approximate percentage gap

mean_log_men <- wage_data |>
  filter(Female == 0) |>
  summarise(mean_l_wage = mean(l_wage, na.rm = TRUE)) |>
  pull(mean_l_wage)

mean_log_women <- wage_data |>
  filter(Female == 1) |>
  summarise(mean_l_wage = mean(l_wage, na.rm = TRUE)) |>
  pull(mean_l_wage)

approx_pct_gap <- 100 * (mean_log_men - mean_log_women)

cat("Mean log wage (Men):", round(mean_log_men, 4), "\n")

Mean log wage (Men): 4.7336

cat("Mean log wage (Women):", round(mean_log_women, 4), "\n")

Mean log wage (Women): 4.483

cat("Approximate percentage gap:", round(approx_pct_gap, 2), "%\n")

Approximate percentage gap: 25.06 %

The difference in mean log wages can be interpreted approximately as a percentage gap when multiplied by 100.

A positive value means that men have higher average wages in percentage terms.

Part 3: Exploring Confounders

1. Education levels by Gender

educ_table <- table(wage_data$gender, wage_data$Educ)
educ_table

       
          1   2   3   4
  Men   108  77  72  59
  Women  88  57  33   6

kable(as.data.frame.matrix(educ_table), caption = "Education Levels by Gender")

Education Levels by Gender
	1	2	3	4
Men	108	77	72	59
Women	88	57	33	6

This table shows how education levels are distributed for men and women.

To answer the question, find the education level with the highest frequency for each gender.

2. Part-time work by Gender

parttime_summary <- wage_data |>
  group_by(gender) |>
  summarise(
    proportion_parttime = mean(Parttime == 1, na.rm = TRUE)
  )

kable(parttime_summary, digits = 3, caption = "Proportion of Part-Time Workers by Gender")

Proportion of Part-Time Workers by Gender
gender	proportion_parttime
Men	0.225
Women	0.560

This table shows the proportion of part-time workers among men and women.

If one group has a much higher part-time rate, this may help explain part of the observed wage gap because part-time jobs may pay less on average.

3. Age distribution by Gender

age_summary <- wage_data |>
  group_by(gender) |>
  summarise(
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE)
  )

kable(age_summary, digits = 2, caption = "Mean and Median Age by Gender")

Mean and Median Age by Gender
gender	mean_age	median_age
Men	40.05	39
Women	39.94	39

This table compares the age distribution of men and women.

If one group is noticeably younger or older, age may explain part of the wage gap because age is often related to work experience.

Part 4: Interpretation

Why use log(wage) instead of wage?

Answer:

Economists often use log(wage) instead of raw wage for two main reasons.

First, wage distributions are usually right-skewed, so the log transformation makes the data more symmetric and easier to analyze.

Second, differences in log wages are approximately percentage differences, which makes the gender wage gap easier to interpret.

2. Is the raw wage gap the same as discrimination?

Answer:

No, the raw wage gap is not necessarily the same as discrimination.

The raw gap only shows the average difference in wages between men and women.

Other factors such as education, age, and part-time work may also affect wages.

Therefore, the entire raw wage gap cannot automatically be interpreted as discrimination.

Final Results

cat("Raw wage gap (Men - Women):", round(raw_gap, 2), "\n")

Raw wage gap (Men - Women): 27.81

cat("Approximate percentage log wage gap:", round(approx_pct_gap, 2), "%\n")

Approximate percentage log wage gap: 25.06 %

Conclusion:

The raw wage distribution and the log wage distribution help us understand the gender wage gap from two different perspectives.

The raw wage gap shows the difference in average wages in dollar terms, while the log wage gap gives an approximate percentage difference.

Differences in education, part-time work, and age may explain part of the observed gap, so the raw difference should not be interpreted as pure discrimination without further analysis.

Task	How AI Helped
Fixing R/Quarto code structure	Helped correct small coding mistakes and ensured the code runs properly
Writing short explanations	Suggested simple and short explanations in a student-friendly style
Improving formatting	Helped organize the document in a clearer and more readable way