Ayberk_KOCAKIR_QUİZ_2

Author

Ayberk KOCAKIR

library(readxl)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
data <- read_excel("ECON465_DataScience/data/Wage_GenderDS.xlsx")
glimpse(data)
Rows: 500
Columns: 6
$ Observation <dbl> 119, 2, 41, 65, 246, 254, 74, 12, 9, 237, 79, 294, 182, 25…
$ Wage        <dbl> 32, 34, 37, 38, 38, 38, 39, 40, 42, 43, 44, 45, 46, 46, 47…
$ Female      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Age         <dbl> 31, 42, 31, 33, 21, 28, 31, 28, 25, 25, 44, 25, 31, 42, 38…
$ Educ        <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1…
$ Parttime    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
wage_data <- data |>
  select(Wage, Female, Age, Educ, Parttime) |>
  filter(!is.na(Wage),
         !is.na(Female),
         !is.na(Age),
         !is.na(Educ),
         !is.na(Parttime))

Part 1: Distribution of Wage

1-Histogram of Wage

library(ggplot2)

wage_data |>
  ggplot(aes(x = Wage)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Wage",
    x = "Wage",
    y = "Count"
  ) +
  theme_minimal()

The distribution of Wage is right-skewed. This means most people in the dataset earn lower wages, and only a smaller number of people earn higher wages. In the histogram, most bars are concentrated on the left side, while the right side extends further with fewer observations. This creates a long right tail, so the distribution is not symmetric.

2-Boxplot of Wage by Gender

wage_data |>
  ggplot(aes(x = factor(Female), y = Wage)) +
  geom_boxplot(fill = "steelblue") +
  labs(
    title = "Boxplot of Wage by Gender",
    x = "Female (0 = Men, 1 = Women)",
    y = "Wage"
  ) +
  theme_minimal()

The boxplot shows the distribution of wages for men (Female = 0) and women (Female = 1). The median wage for men is higher than for women, which indicates that men tend to earn more on average. The interquartile range for men is wider/narrower than for women, which shows that the variation in wages is higher/lower for men. Additionally, there are some outliers in both groups, but there appear to be more extreme high values for men. This suggests that high wages are more common among men compared to women.

3-Summary statistics

wage_summary <- wage_data |>
  group_by(Female) |>
  summarise(
    mean = mean(Wage),
    median = median(Wage),
    sd = sd(Wage),
    min = min(Wage),
    max = max(Wage)
  )

wage_summary
# A tibble: 2 × 6
  Female  mean median    sd   min   max
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1      0 125.   111    57.3    38   384
2      1  97.3   83.5  46.3    32   364
mean_men <- wage_data |>
  filter(Female == 0) |>
  summarise(mean_wage = mean(Wage)) |>
  pull(mean_wage)

mean_women <- wage_data |>
  filter(Female == 1) |>
  summarise(mean_wage = mean(Wage)) |>
  pull(mean_wage)

raw_gap <- mean_men - mean_women
raw_gap
[1] 27.80682

The summary statistics show that the mean and median wage for men are higher than for women. The standard deviation is also higher for men, indicating that wages are more dispersed among men. In addition, the maximum wage is higher for men, which is consistent with the presence of more extreme high values in the male group.

The raw wage gap is approximately 27.81 dollars, calculated as the difference between the mean wage of men and women. This indicates that, on average, men earn about 27.81 dollars more than women.

Part 2: Log Transformation

4-Create log(Wage)

wage_data$l_wage <- log(wage_data$Wage)
library(ggplot2)

wage_data |>
  ggplot(aes(x = l_wage)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of log(Wage)",
    x = "log(Wage)",
    y = "Count"
  ) +
  theme_minimal()

Compared to the raw Wage histogram, the distribution of log(Wage) is more symmetric and closer to a normal shape. In the original Wage distribution, there was a strong right skew with a long right tail. After applying the log transformation, this skewness is reduced, and the distribution becomes more balanced. This happens because the log transformation reduces the impact of very high wage values.

5-Boxplot of l_wage by Gender

wage_data |>
  ggplot(aes(x = factor(Female), y = l_wage)) +
  geom_boxplot(fill = "steelblue") +
  labs(
    title = "Boxplot of log(Wage) by Gender",
    x = "Female (0 = Men, 1 = Women)",
    y = "log(Wage)"
  ) +
  theme_minimal()

The boxplot of log(Wage) shows that the wage gap between men and women still exists, since the median of men is higher than that of women. However, compared to the raw Wage boxplot, the difference between the groups appears smaller. This is because the log transformation reduces the impact of extreme high wage values.

6-Approximate percentage gap

mean_log_men <- wage_data |>
  filter(Female == 0) |>
  summarise(mean = mean(l_wage)) |>
  pull(mean)

mean_log_women <- wage_data |>
  filter(Female == 1) |>
  summarise(mean = mean(l_wage)) |>
  pull(mean)

percentage_gap <- 100 * (mean_log_men - mean_log_women)

percentage_gap
[1] 25.06425

The approximate percentage wage gap is 25.06%, calculated as 100 times the difference between the mean log wages of men and women. This indicates that, on average, men earn about 25% more than women.

This percentage gap is slightly different from the raw wage gap, as the log transformation provides a more accurate comparison by reducing the effect of extreme values.

Part 3: Exploring Confounders

7-Education levels by gender

table(wage_data$Educ, wage_data$Female)
   
      0   1
  1 108  88
  2  77  57
  3  72  33
  4  59   6
wage_data |>
  count(Female, Educ)
# A tibble: 8 × 3
  Female  Educ     n
   <dbl> <dbl> <int>
1      0     1   108
2      0     2    77
3      0     3    72
4      0     4    59
5      1     1    88
6      1     2    57
7      1     3    33
8      1     4     6

The most common education level among women is level 1.

The most common education level among men is also level 1.

8-Part‑time work by gender

wage_data |>
  group_by(Female) |>
  summarise(parttime_rate = mean(Parttime == 1))
# A tibble: 2 × 2
  Female parttime_rate
   <dbl>         <dbl>
1      0         0.225
2      1         0.560

The proportion of part-time workers is higher among women. Specifically, about 22.47% of men work part-time, while this proportion is approximately 55.98% for women.

Differences in part-time work can affect the observed wage gap because part-time jobs usually pay less than full-time jobs. Since a higher proportion of women work part-time, their average wages tend to be lower, which can increase the observed wage gap.

9-Age distribution

wage_data |>
  group_by(Female) |>
  summarise(
    mean_age = mean(Age),
    median_age = median(Age)
  )
# A tibble: 2 × 3
  Female mean_age median_age
   <dbl>    <dbl>      <dbl>
1      0     40.1         39
2      1     39.9         39

The mean and median ages of men and women are very similar. The mean age is approximately 40.05 for men and 39.94 for women, while the median age is 39 for both groups. This suggests that age is unlikely to explain a significant part of the wage gap, since both groups have nearly identical age distributions.

Part 4: Interpretation

10-Why use log(wage)?

Economists often use log(wage) instead of raw wages for two main reasons. First, log transformation reduces skewness and makes the distribution more symmetric by reducing the impact of very high wage values. Second, differences in log wages can be interpreted as approximate percentage differences, which makes it easier to compare wage gaps between groups.

11-Is the raw wage gap the same as discrimination?

The raw wage gap is not the same as discrimination. It simply shows the difference in average wages between men and women, but it does not account for other factors that may affect wages.

AI Use Log

Task How AI was used
Data preparation Helped with basic data selection and cleaning steps
Log transformation Assisted in creating the log(wage) variable
Visualization Helped generate histogram and boxplot codes
Debugging Fixed coding errors and path issues
Writing support Improved clarity and flow of explanations
Interpretation No direct use; interpretations are my own