The data for the analysis is taken from the story behind The Economic Guide To Picking A College Major (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).
recent_grads <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-16/recent-grads.csv")
Tidying the data for analysis.
majors_processed <- recent_grads %>%
arrange(desc(Median)) %>%
mutate(Major = str_to_title(Major),
Major = fct_reorder(Major, Median))
recent_grads %>%
ggplot(aes(Median))+
geom_histogram()+
xlab("Median Salaries")+
ylab("Total Number of people")
We can see from the histogram that salaries are skewed towards left with a median of $36000. There is also an outlier making much more than typical college majors. Lets breakdown salaries by college majors to see which one makes most money.
by_major_category <- majors_processed %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
MedianSalary = sum(Median * Sample_size) / sum(Sample_size)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(ShareWomen))
majors_processed %>%
filter(Sample_size >= 100) %>%
head(20) %>%
ggplot(aes(Major, Median, color = Major_category)) +
geom_point() +
geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
expand_limits(y = 0) +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(title = "What are the highest-earning majors?",
subtitle = "Top 20 majors with at least 100 graduates. surveyed.\n Bars represent the 25th to 75th percentile.",
x = "",
y = "Median salary of gradates" )
Majority of high paying jobs are STEM jobs with engineering majors paying consistently more than other majors with all top 5 paying majors being engineering. Lets aggregate the data by category to find out which categories are better off on average compared to others.
majors_processed %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median, fill = Major_category)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
expand_limits(y = 0) +
coord_flip() +
theme(legend.position = "none")
Inline with our expectation, engineering on average pays much more compared to other categories whereas Psychology and Social work pays lowest.
majors_processed %>%
arrange(desc(Total)) %>%
head(20) %>%
mutate(Major = fct_reorder(Major, Total)) %>%
gather(Gender, Number, Men, Women) %>%
ggplot(aes(Major, Number, fill = Gender)) +
geom_col() +
coord_flip()+
labs(title = "Majors broken down by gender",
subtitle = "Colors represent ratio of genders",
)+
scale_y_continuous(name="Total Students", labels = comma)
The most popular major is psychology which also happens to be the lowest-paying job and is dominated by women.
Is it just psychology or is it general trend that women dominated majors pay less? Lets find out.
by_major_category %>%
mutate(Major_category = fct_lump(Major_category, 6)) %>%
ggplot(aes(ShareWomen, MedianSalary, color = Major_category)) +
geom_point() +
geom_smooth(method = "lm") +
geom_text_repel(aes(label = Major_category), force = .2) +
expand_limits(y = 0)+
xlab("Ratio of women")+
ylab("Median Salary")
It looks like there is a negative correlation between ratio of women in jobs and the typical salary commanded by that major.
library(plotly )
g <- majors_processed %>%
mutate(Major_category = fct_lump(Major_category, 5)) %>%
ggplot(aes(ShareWomen, Median, color = Major_category, label = Major)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = dollar_format()) +
expand_limits(y = 0)+
xlab("Ratio of women")+
ylab("Median Salary")
ggplotly(g)
Playing with his graph interactively uncovers a few interesting insights.
Petroleum Engineering is an outlier both in terms of ratio of women as well as the median salary commanded. However it has a caveat of a small sample size of 36 observations.
There is definitely a negative correlation between ratio of women in a major and typical salary commanded by that profession.
Engineering pays much more than other types of jobs but very few women opt for it.
Education is on the lower end of spectrum in terms of salary and contains a higher proportion of women.
We are regressing the median salary against ratio of women, without taking other factors into account to get an overall picture.
majors_processed %>%
select(Major, Total, ShareWomen, Sample_size, Median) %>%
lm(Median ~ ShareWomen, data = ., weights = Sample_size) %>%
summary()
##
## Call:
## lm(formula = Median ~ ShareWomen, data = ., weights = Sample_size)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -260500 -61042 -13899 33262 865081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52073 1436 36.255 <2e-16 ***
## ShareWomen -23650 2403 -9.842 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123000 on 170 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.363, Adjusted R-squared: 0.3592
## F-statistic: 96.87 on 1 and 170 DF, p-value: < 2.2e-16
The P value is significant, so the effect between ratio of women and median salary is unlikely to occur just by chance alone. It means if we change the ratio of women from 0% to 100% for a major, the expected salary decreases by $23650 which is quite significant. Put another way, the expected salary decreases by ~$230 on average for every 1% increase in ratio of women for a major.
Is this effect due to Simpson’s Paradox or it persists within categories as well?
majors_processed %>%
select(Major, Major_category, Total, ShareWomen, Sample_size, Median) %>%
add_count(Major_category) %>%
filter(n >= 10) %>%
nest(-Major_category) %>%
mutate(model = map(data, ~ lm(Median ~ ShareWomen, data = ., weights = Sample_size)),
tidied = map(model, tidy)) %>%
unnest(tidied) %>%
filter(term == "ShareWomen") %>%
arrange(estimate) %>%
mutate(fdr = p.adjust(p.value, method = "fdr")) %>%
select(!c(data,model)) %>%
print.data.frame()
## Major_category term estimate std.error statistic
## 1 Biology & Life Science ShareWomen -43735.061 20982.134 -2.0843953
## 2 Engineering ShareWomen -33912.267 15417.620 -2.1995786
## 3 Computers & Mathematics ShareWomen -28693.834 18552.119 -1.5466608
## 4 Business ShareWomen -28171.187 9810.325 -2.8715855
## 5 Agriculture & Natural Resources ShareWomen -16262.567 5975.033 -2.7217535
## 6 Physical Sciences ShareWomen -12820.443 13348.584 -0.9604347
## 7 Education ShareWomen -1995.791 3083.572 -0.6472337
## 8 Humanities & Liberal Arts ShareWomen -1813.510 4128.104 -0.4393081
## 9 Health ShareWomen 54721.050 23426.631 2.3358480
## p.value fdr
## 1 0.05916032 0.10648857
## 2 0.03658007 0.09367468
## 3 0.15634844 0.23452266
## 4 0.01519835 0.09367468
## 5 0.02969027 0.09367468
## 6 0.36496151 0.46923622
## 7 0.52795143 0.59394536
## 8 0.66765212 0.66765212
## 9 0.04163319 0.09367468
Since we are comparing multiple P values, there is a chance of one of them being higher than some thershold (typically chosen to be 0.05). There are multiple methods to control for it such as bonferroni correction but it is very conservative so we using False discovery rate to control for it instead.
The effect exists within categories as well, majors with a higher proportion of women has a lower median salary compared to professions with higher male ratio.
Determine whether the effect between salary and proportion of gender in a profession is indicative of gender wage gap or it can be explained by neutral factors such as years of experience in industry, college,rank within organization etc.