quiz2_econ465

Author

ozge yilmaz

library(readxl)
library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.5.2

data <- read_excel("Wage_GenderDS.xlsx")

Part 1: Distribution of Wage

Histogram of Wage

wage_histogram <- data|>
  ggplot(aes(x = Wage)) + geom_histogram(binwidth = 1, fill = "blue") + labs( title = "histogram of wage", x = "wage", y ="hour")+
  theme_minimal()

its left skewed, we observe lower wage levels.

Boxplot of Wage by Gender

wagee_data <- data |>
  mutate(Gender = if_else(Female == 0, "Men", "Women"))

wage_boxplot <- wagee_data |>
  ggplot(aes(x = Gender, y = Wage, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Wage by Gender", x = "gender", y = "wage") +
  theme_minimal()

wage_boxplot

Mens median is higher than womans. Mens one is above 100 while womans is below 100. Boht groups have outliers. Mens outliers are higher than womans.

Summary statistics

summary_wage <- data |>
  group_by(Female) |>
  summarize(
    mean_wage   = mean(Wage, na.rm = TRUE),
    median_wage = median(Wage, na.rm = TRUE),
    sd_wage     = sd(Wage, na.rm = TRUE),
    min_wage    = min(Wage, na.rm = TRUE),
    max_wage    = max(Wage, na.rm = TRUE))

summary_wage

# A tibble: 2 × 6
  Female mean_wage median_wage sd_wage min_wage max_wage
   <dbl>     <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
1      0     125.        111      57.3       38      384
2      1      97.3        83.5    46.3       32      364

mean_men <- data |>
  filter(Female == 0) |>
  summarize(mean_wage = mean(Wage, na.rm = TRUE))

mean_women <- data |>
  filter(Female == 1) |>
  summarize(mean_wage = mean(Wage, na.rm = TRUE))

raw_wage_gap <- mean_men$mean_wage - mean_women$mean_wage
raw_wage_gap

[1] 27.80682

Part 2: Log Transformation

Create log(Wage)

data <- data |>
  mutate(l_wage = log(Wage))


lwage_histogram <- data |>
  ggplot(aes(x = l_wage)) +
  geom_histogram(binwidth = 0.5, fill = "pink") +
  labs(
    title = "Histogram of Log(Wage)",
    x = "log(Wage)",
    y = "Count"
  ) +
  theme_minimal()

lwage_histogram

Its normally distributed compared to the raw wage one.

Boxplot of l_wage by Gender

data_gender <- data |>
  mutate(Gender = if_else(Female == 0, "Men", "Women"))

lwage_boxplot <- data_gender |>
  ggplot(aes(x = Gender, y = l_wage, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Log(Wage) by Gender",
    x = "Gender",
    y = "log(Wage)"
  ) +
  theme_minimal()

lwage_boxplot

log makes it easier to see and easier to read.

Approximate percentage gap

mean_lwage_men <- data |>
  filter(Female == 0) |>
  summarize(mean_lwage = mean(l_wage, na.rm = TRUE))

mean_lwage_women <- data |>
  filter(Female == 1) |>
  summarize(mean_lwage = mean(l_wage, na.rm = TRUE))

approx_pct_gap <- 100 * (mean_lwage_men$mean_lwage - mean_lwage_women$mean_lwage)
approx_pct_gap

[1] 25.06425

Part 3: Exploring Confounders

Education levels by gender

educ_table <- data |>
  group_by(Female, Educ) |>
  summarize(count = n()) |>
  arrange(Female, Educ)

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Female and Educ.
ℹ Output is grouped by Female.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Female, Educ))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

educ_table

# A tibble: 8 × 3
# Groups:   Female [2]
  Female  Educ count
   <dbl> <dbl> <int>
1      0     1   108
2      0     2    77
3      0     3    72
4      0     4    59
5      1     1    88
6      1     2    57
7      1     3    33
8      1     4     6

most_common_educ_women <- data |>
  filter(Female == 1) |>
  group_by(Educ) |>
  summarize(count = n()) |>
  arrange(desc(count))

most_common_educ_women

# A tibble: 4 × 2
   Educ count
  <dbl> <int>
1     1    88
2     2    57
3     3    33
4     4     6

most_common_educ_men <- data |>
  filter(Female == 0) |>
  group_by(Educ) |>
  summarize(count = n()) |>
  arrange(desc(count))

most_common_educ_men

# A tibble: 4 × 2
   Educ count
  <dbl> <int>
1     1   108
2     2    77
3     3    72
4     4    59

among woman, educ level 1 is the most among men also 1 is most

Part‑time work by gender

parttime_table <- data |>
  group_by(Female) |>
  summarize(
    parttime_rate = mean(Parttime == 1, na.rm = TRUE)
  )

parttime_table

# A tibble: 2 × 2
  Female parttime_rate
   <dbl>         <dbl>
1      0         0.225
2      1         0.560

Woman are much moe than man in part time working. this may affect the observed wage gap because part-time jobs are often associated with lower pay and fewer hours

Age distribution

age_summary <- data |>
  group_by(Female) |>
  summarize(
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE)
  )

age_summary

# A tibble: 2 × 3
  Female mean_age median_age
   <dbl>    <dbl>      <dbl>
1      0     40.1         39
2      1     39.9         39

similar age dist. age doesnt really have a role

Part 4: Interpretation (Short Answer)

Why use log(wage) instead of wage?

Economists often use log(wage) instead of raw wage for two reasons. First, wages are usually right-skewed, and the log transformation makes the distribution more symmetric. Second, differences in log wages can be interpreted approximately as percentage differences, which is very useful when analysing wage gaps.

Is the raw wage gap the same as discrimination?

No, the raw wage gap is not the same as discrimination. Part of the gap may be explained by other factors such as education and part-time work. In this dataset, women are much more likely to work part-time, so the entire raw wage gap cannot automatically be attributed to discrimination.