The data for the analysis is taken from the story behind The Economic Guide To Picking A College Major (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

recent_grads <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-16/recent-grads.csv")

Tidying the data for analysis.

majors_processed <- recent_grads %>%
  arrange(desc(Median)) %>%
  mutate(Major = str_to_title(Major),
         Major = fct_reorder(Major, Median))

Distribution of Salaries

recent_grads %>%
  ggplot(aes(Median))+
  geom_histogram()+
  xlab("Median Salaries")+
  ylab("Total Number of people")

We can see from the histogram that salaries are skewed towards left with a median of $36000. There is also an outlier making much more than typical college majors. Lets breakdown salaries by college majors to see which one makes most money.

by_major_category <- majors_processed %>%
  filter(!is.na(Total)) %>%
  group_by(Major_category) %>%
  summarize(Men = sum(Men),
            Women = sum(Women),
            Total = sum(Total),
            MedianSalary = sum(Median * Sample_size) / sum(Sample_size)) %>%
  mutate(ShareWomen = Women / Total) %>%
  arrange(desc(ShareWomen))

What are Highest Earning majors?

majors_processed %>%
  filter(Sample_size >= 100) %>%
  head(20) %>%
  ggplot(aes(Major, Median, color = Major_category)) +
  geom_point() +
  geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
  expand_limits(y = 0) +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(title = "What are the highest-earning majors?",
       subtitle = "Top 20 majors with at least 100 graduates. surveyed.\n Bars represent the 25th to 75th percentile.",
       x = "",
       y = "Median salary of gradates" )

Majority of high paying jobs are STEM jobs with engineering majors paying consistently more than other majors with all top 5 paying majors being engineering. Lets aggregate the data by category to find out which categories are better off on average compared to others.

What categories of majors make more money than others?

  majors_processed %>%
  mutate(Major_category = fct_reorder(Major_category, Median)) %>%
  ggplot(aes(Major_category, Median, fill = Major_category)) +
  geom_boxplot() +
  scale_y_continuous(labels = dollar_format()) +
  expand_limits(y = 0) +
  coord_flip() +
  theme(legend.position = "none")

Inline with our expectation, engineering on average pays much more compared to other categories whereas Psychology and Social work pays lowest.

How do typical earnings relate to gender?

majors_processed %>%
  arrange(desc(Total)) %>%
  head(20) %>%
  mutate(Major = fct_reorder(Major, Total)) %>%
  gather(Gender, Number, Men, Women) %>%
  ggplot(aes(Major, Number, fill = Gender)) +
  geom_col() +
  coord_flip()+
   labs(title = "Majors broken down by gender",
       subtitle = "Colors represent ratio of genders",
      
   )+
   scale_y_continuous(name="Total Students", labels = comma)

The most popular major is psychology which also happens to be the lowest-paying job and is dominated by women.

Is it just psychology or is it general trend that women dominated majors pay less? Lets find out.

Relationship between median salary and ratio of women in a profession

by_major_category %>%
  mutate(Major_category = fct_lump(Major_category, 6)) %>%
  ggplot(aes(ShareWomen, MedianSalary, color = Major_category)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_text_repel(aes(label = Major_category), force = .2) +
  expand_limits(y = 0)+
  xlab("Ratio of women")+
  ylab("Median Salary")

It looks like there is a negative correlation between ratio of women in jobs and the typical salary commanded by that major.

Interactive scatter plot between median salary and ratio of women in a profession

library(plotly )
g <- majors_processed %>%
  mutate(Major_category = fct_lump(Major_category, 5)) %>%
  ggplot(aes(ShareWomen, Median, color = Major_category, label = Major)) +
  geom_point() +
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = dollar_format()) +
  expand_limits(y = 0)+
  xlab("Ratio of women")+
  ylab("Median Salary")

ggplotly(g)

Playing with his graph interactively uncovers a few interesting insights.

  1. Petroleum Engineering is an outlier both in terms of ratio of women as well as the median salary commanded. However it has a caveat of a small sample size of 36 observations.

  2. There is definitely a negative correlation between ratio of women in a major and typical salary commanded by that profession.

  3. Engineering pays much more than other types of jobs but very few women opt for it.

  4. Education is on the lower end of spectrum in terms of salary and contains a higher proportion of women.

Quantifying the effect between ratio of women and median salary.

We are regressing the median salary against ratio of women, without taking other factors into account to get an overall picture.

majors_processed %>%
  select(Major, Total, ShareWomen, Sample_size, Median) %>%
  lm(Median ~ ShareWomen, data = ., weights = Sample_size) %>%
  summary()
## 
## Call:
## lm(formula = Median ~ ShareWomen, data = ., weights = Sample_size)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -260500  -61042  -13899   33262  865081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52073       1436  36.255   <2e-16 ***
## ShareWomen    -23650       2403  -9.842   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123000 on 170 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3592 
## F-statistic: 96.87 on 1 and 170 DF,  p-value: < 2.2e-16

The P value is significant, so the effect between ratio of women and median salary is unlikely to occur just by chance alone. It means if we change the ratio of women from 0% to 100% for a major, the expected salary decreases by $23650 which is quite significant. Put another way, the expected salary decreases by ~$230 on average for every 1% increase in ratio of women for a major.

Is this effect due to Simpson’s Paradox or it persists within categories as well?

Stratification by Major

majors_processed %>%
  select(Major, Major_category, Total, ShareWomen, Sample_size, Median) %>%
  add_count(Major_category) %>%
  filter(n >= 10) %>%
  nest(-Major_category) %>%
  mutate(model = map(data, ~ lm(Median ~ ShareWomen, data = ., weights = Sample_size)),
         tidied = map(model, tidy)) %>%
  unnest(tidied) %>%
  filter(term == "ShareWomen") %>%
  arrange(estimate) %>%
  mutate(fdr = p.adjust(p.value, method = "fdr")) %>%
select(!c(data,model)) %>%
  print.data.frame()
##                    Major_category       term   estimate std.error  statistic
## 1          Biology & Life Science ShareWomen -43735.061 20982.134 -2.0843953
## 2                     Engineering ShareWomen -33912.267 15417.620 -2.1995786
## 3         Computers & Mathematics ShareWomen -28693.834 18552.119 -1.5466608
## 4                        Business ShareWomen -28171.187  9810.325 -2.8715855
## 5 Agriculture & Natural Resources ShareWomen -16262.567  5975.033 -2.7217535
## 6               Physical Sciences ShareWomen -12820.443 13348.584 -0.9604347
## 7                       Education ShareWomen  -1995.791  3083.572 -0.6472337
## 8       Humanities & Liberal Arts ShareWomen  -1813.510  4128.104 -0.4393081
## 9                          Health ShareWomen  54721.050 23426.631  2.3358480
##      p.value        fdr
## 1 0.05916032 0.10648857
## 2 0.03658007 0.09367468
## 3 0.15634844 0.23452266
## 4 0.01519835 0.09367468
## 5 0.02969027 0.09367468
## 6 0.36496151 0.46923622
## 7 0.52795143 0.59394536
## 8 0.66765212 0.66765212
## 9 0.04163319 0.09367468

Since we are comparing multiple P values, there is a chance of one of them being higher than some thershold (typically chosen to be 0.05). There are multiple methods to control for it such as bonferroni correction but it is very conservative so we using False discovery rate to control for it instead.

The effect exists within categories as well, majors with a higher proportion of women has a lower median salary compared to professions with higher male ratio.

Future work

Determine whether the effect between salary and proportion of gender in a profession is indicative of gender wage gap or it can be explained by neutral factors such as years of experience in industry, college,rank within organization etc.