quiz2_econ465

Author

ozge yilmaz

library(readxl)
library(dplyr)
Warning: package 'dplyr' was built under R version 4.5.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.2
data <- read_excel("Wage_GenderDS.xlsx")

Part 1: Distribution of Wage

  1. Histogram of Wage
wage_histogram <- data|>
  ggplot(aes(x = Wage)) + geom_histogram(binwidth = 1, fill = "blue") + labs( title = "histogram of wage", x = "wage", y ="hour")+
  theme_minimal()

its left skewed, we observe lower wage levels.

  1. Boxplot of Wage by Gender
wagee_data <- data |>
  mutate(Gender = if_else(Female == 0, "Men", "Women"))

wage_boxplot <- wagee_data |>
  ggplot(aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Wage by Gender", x = "gender", y = "wage") +
  theme_minimal()

wage_boxplot

Mens median is higher than womans. Mens one is above 100 while womans is below 100. Boht groups have outliers. Mens outliers are higher than womans.

  1. Summary statistics
summary_wage <- data |>
  group_by(Female) |>
  summarize(
    mean_wage   = mean(Wage, na.rm = TRUE),
    median_wage = median(Wage, na.rm = TRUE),
    sd_wage     = sd(Wage, na.rm = TRUE),
    min_wage    = min(Wage, na.rm = TRUE),
    max_wage    = max(Wage, na.rm = TRUE))

summary_wage
# A tibble: 2 × 6
  Female mean_wage median_wage sd_wage min_wage max_wage
   <dbl>     <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
1      0     125.        111      57.3       38      384
2      1      97.3        83.5    46.3       32      364
mean_men <- data |>
  filter(Female == 0) |>
  summarize(mean_wage = mean(Wage, na.rm = TRUE))

mean_women <- data |>
  filter(Female == 1) |>
  summarize(mean_wage = mean(Wage, na.rm = TRUE))

raw_wage_gap <- mean_men$mean_wage - mean_women$mean_wage
raw_wage_gap
[1] 27.80682

Part 2: Log Transformation

  1. Create log(Wage)
data <- data |>
  mutate(l_wage = log(Wage))


lwage_histogram <- data |>
  ggplot(aes(x = l_wage)) +
  geom_histogram(binwidth = 0.5, fill = "pink") +
  labs(
    title = "Histogram of Log(Wage)",
    x = "log(Wage)",
    y = "Count"
  ) +
  theme_minimal()

lwage_histogram

Its normally distributed compared to the raw wage one.

  1. Boxplot of l_wage by Gender
data_gender <- data |>
  mutate(Gender = if_else(Female == 0, "Men", "Women"))

lwage_boxplot <- data_gender |>
  ggplot(aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Log(Wage) by Gender",
    x = "Gender",
    y = "log(Wage)"
  ) +
  theme_minimal()

lwage_boxplot


log makes it easier to see and easier to read.

  1. Approximate percentage gap
mean_lwage_men <- data |>
  filter(Female == 0) |>
  summarize(mean_lwage = mean(l_wage, na.rm = TRUE))

mean_lwage_women <- data |>
  filter(Female == 1) |>
  summarize(mean_lwage = mean(l_wage, na.rm = TRUE))

approx_pct_gap <- 100 * (mean_lwage_men$mean_lwage - mean_lwage_women$mean_lwage)
approx_pct_gap
[1] 25.06425

Part 3: Exploring Confounders

  1. Education levels by gender

    educ_table <- data |>
      group_by(Female, Educ) |>
      summarize(count = n()) |>
      arrange(Female, Educ)
    `summarise()` has regrouped the output.
    ℹ Summaries were computed grouped by Female and Educ.
    ℹ Output is grouped by Female.
    ℹ Use `summarise(.groups = "drop_last")` to silence this message.
    ℹ Use `summarise(.by = c(Female, Educ))` for per-operation grouping
      (`?dplyr::dplyr_by`) instead.
    educ_table
    # A tibble: 8 × 3
    # Groups:   Female [2]
      Female  Educ count
       <dbl> <dbl> <int>
    1      0     1   108
    2      0     2    77
    3      0     3    72
    4      0     4    59
    5      1     1    88
    6      1     2    57
    7      1     3    33
    8      1     4     6
    most_common_educ_women <- data |>
      filter(Female == 1) |>
      group_by(Educ) |>
      summarize(count = n()) |>
      arrange(desc(count))
    
    most_common_educ_women
    # A tibble: 4 × 2
       Educ count
      <dbl> <int>
    1     1    88
    2     2    57
    3     3    33
    4     4     6
    most_common_educ_men <- data |>
      filter(Female == 0) |>
      group_by(Educ) |>
      summarize(count = n()) |>
      arrange(desc(count))
    
    most_common_educ_men
    # A tibble: 4 × 2
       Educ count
      <dbl> <int>
    1     1   108
    2     2    77
    3     3    72
    4     4    59

    among woman, educ level 1 is the most among men also 1 is most

  2. Part‑time work by gender

    parttime_table <- data |>
      group_by(Female) |>
      summarize(
        parttime_rate = mean(Parttime == 1, na.rm = TRUE)
      )
    
    parttime_table
    # A tibble: 2 × 2
      Female parttime_rate
       <dbl>         <dbl>
    1      0         0.225
    2      1         0.560

    Woman are much moe than man in part time working. this may affect the observed wage gap because part-time jobs are often associated with lower pay and fewer hours

  3. Age distribution

    age_summary <- data |>
      group_by(Female) |>
      summarize(
        mean_age = mean(Age, na.rm = TRUE),
        median_age = median(Age, na.rm = TRUE)
      )
    
    age_summary
    # A tibble: 2 × 3
      Female mean_age median_age
       <dbl>    <dbl>      <dbl>
    1      0     40.1         39
    2      1     39.9         39

    similar age dist. age doesnt really have a role

Part 4: Interpretation (Short Answer)

  1. Why use log(wage) instead of wage?

Economists often use log(wage) instead of raw wage for two reasons. First, wages are usually right-skewed, and the log transformation makes the distribution more symmetric. Second, differences in log wages can be interpreted approximately as percentage differences, which is very useful when analysing wage gaps.

  1. Is the raw wage gap the same as discrimination?

No, the raw wage gap is not the same as discrimination. Part of the gap may be explained by other factors such as education and part-time work. In this dataset, women are much more likely to work part-time, so the entire raw wage gap cannot automatically be attributed to discrimination.