This vignette explores gender-based wage differences using a dataset sourced from Kaggle. It demonstrates how to use readr, dplyr, tidyr, and ggplot2 to group, summarize, reshape, and visualize salary trends. Specifically, it analyzes average BasePay and Bonus by Gender and highlights the job titles with the largest wage gaps.
url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-tidyverse-project/refs/heads/main/Glassdoor%20Gender%20Pay%20Gap.csv"
df <- read_csv(url)
The data set has been taken from glassdoor as of 2020 and focuses on income for various job titles based on gender. As there have been many studies showcasing that women are paid less than men for the same job titles, this data set will be helpful in identifying the depth of the gender-based pay gap.
Quick data overview:
glimpse(df)
## Rows: 1,000
## Columns: 9
## $ JobTitle <chr> "Graphic Designer", "Software Engineer", "Warehouse Associat…
## $ Gender <chr> "Female", "Male", "Female", "Male", "Male", "Female", "Femal…
## $ Age <dbl> 18, 21, 19, 20, 26, 20, 20, 18, 33, 35, 24, 18, 19, 30, 35, …
## $ PerfEval <dbl> 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ Education <chr> "College", "College", "PhD", "Masters", "Masters", "PhD", "C…
## $ Dept <chr> "Operations", "Management", "Administration", "Sales", "Engi…
## $ Seniority <dbl> 2, 5, 5, 4, 5, 4, 4, 5, 5, 5, 5, 3, 3, 5, 4, 3, 5, 5, 5, 5, …
## $ BasePay <dbl> 42363, 108476, 90208, 108080, 99464, 70890, 67585, 97523, 11…
## $ Bonus <dbl> 9938, 11128, 9268, 10154, 9319, 10126, 10541, 10240, 9836, 9…
summary(df)
## JobTitle Gender Age PerfEval
## Length:1000 Length:1000 Min. :18.00 Min. :1.000
## Class :character Class :character 1st Qu.:29.00 1st Qu.:2.000
## Mode :character Mode :character Median :41.00 Median :3.000
## Mean :41.39 Mean :3.037
## 3rd Qu.:54.25 3rd Qu.:4.000
## Max. :65.00 Max. :5.000
## Education Dept Seniority BasePay
## Length:1000 Length:1000 Min. :1.000 Min. : 34208
## Class :character Class :character 1st Qu.:2.000 1st Qu.: 76850
## Mode :character Mode :character Median :3.000 Median : 93328
## Mean :2.971 Mean : 94473
## 3rd Qu.:4.000 3rd Qu.:111558
## Max. :5.000 Max. :179726
## Bonus
## Min. : 1703
## 1st Qu.: 4850
## Median : 6507
## Mean : 6467
## 3rd Qu.: 8026
## Max. :11293
Average Base Pay and Bonus by Gender:
df %>%
group_by(Gender) %>%
summarise(
avg_base = mean(BasePay, na.rm = TRUE),
avg_bonus = mean(Bonus, na.rm = TRUE),
count = n()
)
## # A tibble: 2 × 4
## Gender avg_base avg_bonus count
## <chr> <dbl> <dbl> <int>
## 1 Female 89943. 6474. 468
## 2 Male 98458. 6461. 532
This informs us that on average, men make nearly $10,000 more than women at the average base pay level.
Wage Gap by Job Title:
wage_gap <- df %>%
group_by(JobTitle, Gender) %>%
summarise(
avg_base = mean(BasePay, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(names_from = Gender, values_from = avg_base) %>%
mutate(gap = Male - Female) %>%
arrange(desc(gap))
head(wage_gap, 10)
## # A tibble: 10 × 4
## JobTitle Female Male gap
## <chr> <dbl> <dbl> <dbl>
## 1 Software Engineer 94701 106371. 11670.
## 2 Marketing Associate 76119. 81882. 5763.
## 3 Driver 86868. 91953. 5085.
## 4 Sales Associate 91894. 94663. 2769.
## 5 IT 90476. 91022. 546.
## 6 Financial Analyst 95458. 94607. -851.
## 7 Manager 127252. 124849. -2403.
## 8 Graphic Designer 92243. 89596. -2647.
## 9 Warehouse Associate 92428. 86553. -5875.
## 10 Data Scientist 95705. 89223. -6482.
Interestingly, this data cluster shows us the extremity of data, highlighting the jobs where men make substantially more and where women also make substantially more. It is noteworthy that at the outlier level, men outpace women by two to one.
Visualization of Largest Wage Gaps:
wage_gap %>%
filter(!is.na(gap)) %>%
slice_max(abs(gap), n = 10) %>%
ggplot(aes(x = reorder(JobTitle, gap), y = gap)) +
geom_col(fill = "tomato") +
coord_flip() +
labs(
title = "Top 10 Job Titles by Gender Wage Gap",
x = "Job Title",
y = "Base Pay Gap (Male - Female)"
)
This cluster of data is interesting to look at, but it fails to tell the
entire story. Let us look at the percentage of job titles where men earn
more than women:
gap_summary <- wage_gap %>%
filter(!is.na(Male), !is.na(Female)) %>%
mutate(gap = Male - Female)
total_jobs <- nrow(gap_summary)
jobs_favor_men <- gap_summary %>%
filter(gap > 0) %>%
nrow()
percent_favor_men <- round((jobs_favor_men / total_jobs) * 100, 1)
cat("Percentage of job titles where men earn more:", percent_favor_men, "%\n")
## Percentage of job titles where men earn more: 50 %
And how would it look if we look at the percentage of jobs where men make more than 5% over women:
gap_summary_pct <- gap_summary %>%
mutate(pct_gap = (Male - Female) / Female)
jobs_favor_men_5pct <- gap_summary_pct %>%
filter(pct_gap >= 0.05) %>%
nrow()
percent_favor_men_5pct <- round((jobs_favor_men_5pct / total_jobs) * 100, 1)
cat("Percentage of job titles where men earn 5% or more than women:", percent_favor_men_5pct, "%\n")
## Percentage of job titles where men earn 5% or more than women: 30 %
This analysis confirms that gender-based wage disparities persist across a wide range of job titles. While a few positions show women earning more, the majority still favor men. Specifically, r percent_favor_men% of job titles in this dataset show higher average base pay for men. Even more striking, r percent_favor_men_5pct% of job titles show men earning at least 5% more than women in the same role.
Although visualizations of extreme gaps offer helpful context, these summary statistics reinforce that the issue is widespread—not just isolated to a handful of high-paying positions. The findings support ongoing concerns about equitable compensation and highlight the need for continued transparency and organizational accountability in addressing wage inequality.