Title:Economic Guide to Picking a College Major - A tidy analysis source of data :fivethirtyeight – Data Summary ———————— Values
Name recent_grads Number of rows 173
Number of columns 21
_______________________
Column type frequency:
character 2
numeric 19
________________________
Group variables None

Objective of the project:

This is the code behind an analysis of “College Major picking and Income.”

Why is this information important?

Gives future college students an idea of what jobs are making the most money

How does gender play a role in the earnings gap and picking major in collage?

Sheds light on the gender gap in earnings between jobs

What type of data are we looking at?

Number of individuals graduating with a specific major

Number of women and men graduating within each specific major

Median income for all the jobs as well as the share of women for each job

What are trending Major Category vs total number of graduates?

Trends of common majors and common major category

Key findings:

Men dominated share of workers for Computer science and most business degrees

Women dominated share of workers for elementary education, nursing, and graphic design

Psychology was by far the most common major

Graphic Design was the least common major Engineering was a top earner

Education was on the lower end of the income spectrum

Negative correlation between median earnings and share of women

Data Exploring

#install.packages("fivethirtyeight")
#install.packages("skimr")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 4.1.3
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(skimr)
## Warning: package 'skimr' was built under R version 4.1.3
theme_set(theme_light())
recent_grads <-college_recent_grads

#Data summary

recent_grads <-college_recent_grads
skim(recent_grads)
Data summary
Name recent_grads
Number of rows 173
Number of columns 21
_______________________
Column type frequency:
character 2
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
major 0 1 5 65 0 173 0
major_category 0 1 4 35 0 16 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1.00 87.00 50.08 1 44.00 87.00 130.00 173.00 ▇▇▇▇▇
major_code 0 1.00 3879.82 1687.75 1100 2403.00 3608.00 5503.00 6403.00 ▃▇▅▃▇
total 1 0.99 39370.08 63483.49 124 4549.75 15104.00 38909.75 393735.00 ▇▁▁▁▁
sample_size 0 1.00 356.08 618.36 2 39.00 130.00 338.00 4212.00 ▇▁▁▁▁
men 1 0.99 16723.41 28122.43 119 2177.50 5434.00 14631.00 173809.00 ▇▁▁▁▁
women 1 0.99 22646.67 41057.33 0 1778.25 8386.50 22553.75 307087.00 ▇▁▁▁▁
sharewomen 1 0.99 0.52 0.23 0 0.34 0.53 0.70 0.97 ▂▆▆▇▃
employed 0 1.00 31192.76 50675.00 0 3608.00 11797.00 31433.00 307933.00 ▇▁▁▁▁
employed_fulltime 0 1.00 26029.31 42869.66 111 3154.00 10048.00 25147.00 251540.00 ▇▁▁▁▁
employed_parttime 0 1.00 8832.40 14648.18 0 1030.00 3299.00 9948.00 115172.00 ▇▁▁▁▁
employed_fulltime_yearround 0 1.00 19694.43 33160.94 111 2453.00 7413.00 16891.00 199897.00 ▇▁▁▁▁
unemployed 0 1.00 2416.33 4112.80 0 304.00 893.00 2393.00 28169.00 ▇▁▁▁▁
unemployment_rate 0 1.00 0.07 0.03 0 0.05 0.07 0.09 0.18 ▂▇▆▁▁
p25th 0 1.00 29501.45 9166.01 18500 24000.00 27000.00 33000.00 95000.00 ▇▂▁▁▁
median 0 1.00 40151.45 11470.18 22000 33000.00 36000.00 45000.00 110000.00 ▇▅▁▁▁
p75th 0 1.00 51494.22 14906.28 22000 42000.00 47000.00 60000.00 125000.00 ▅▇▂▁▁
college_jobs 0 1.00 12322.64 21299.87 0 1675.00 4390.00 14444.00 151643.00 ▇▁▁▁▁
non_college_jobs 0 1.00 13284.50 23789.66 0 1591.00 4595.00 11783.00 148395.00 ▇▁▁▁▁
low_wage_jobs 0 1.00 3859.02 6945.00 0 340.00 1231.00 3466.00 48207.00 ▇▁▁▁▁
#summary(recent_grads)

#What are trending Major Category vs total number of graduates?

#Similarly, I can just group by the Majors column to get a bar plot for the most common majors. But, since we have 173 majors, I will only look at the top 20 most common majors.

## List of 1
##  $ legend.postion: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

What categories of majors make more money than others?

recent_grads %>%
  mutate(major_category = fct_reorder(major_category, median)) %>%
  ggplot(aes(major_category, median, fill = major_category)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar_format()) +
  expand_limits(y = 0) +
  coord_flip() +
  labs( x = "",
        y = "Total # of graduates",title = "What categories of majors make more money than others?") +
  theme(legend.position = "none")

#What is the top earning Major Category based on the Median Salary? There are two plots I can use to answer this question i.e. a bar plot and a box plot. In the EDA phase, I can play around with both plots but while reporting I would drop one of them.

For the bar plot, I would group by the Major Category, and apply the median aggregate function on Median Salary. This would get me the top median salaries per major category.

recent_grads %>%
  group_by(major_category) %>%
  summarise()
## # A tibble: 16 x 1
##    major_category                     
##    <chr>                              
##  1 Agriculture & Natural Resources    
##  2 Arts                               
##  3 Biology & Life Science             
##  4 Business                           
##  5 Communications & Journalism        
##  6 Computers & Mathematics            
##  7 Education                          
##  8 Engineering                        
##  9 Health                             
## 10 Humanities & Liberal Arts          
## 11 Industrial Arts & Consumer Services
## 12 Interdisciplinary                  
## 13 Law & Public Policy                
## 14 Physical Sciences                  
## 15 Psychology & Social Work           
## 16 Social Science
theme_set(theme_light())

#What are the highest earning majors?
As discussed in previous section, there is a lot of variation in Sample Sizes. Hence, I will filter on Sample Size > 100 and sort by Median in descending order to get only the top 20 majors.

#Arrange by Gender

recent_grads %>%
  filter(!is.na(total)) %>%
  group_by(major_category) %>%
  summarize(men = sum(men),
            women = sum(women),
            total = sum(total),
            MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
  mutate(ShareWomen = women / total) %>%
  gather(gender,number,men,women)%>%
  arrange(desc(ShareWomen))
## # A tibble: 32 x 6
##    major_category               total MedianSalary ShareWomen gender number
##    <chr>                        <int>        <dbl>      <dbl> <chr>   <int>
##  1 Health                      463230       43694.      0.837 men     75517
##  2 Health                      463230       43694.      0.837 women  387713
##  3 Education                   559129       32364.      0.815 men    103526
##  4 Education                   559129       32364.      0.815 women  455603
##  5 Psychology & Social Work    481007       31234.      0.796 men     98115
##  6 Psychology & Social Work    481007       31234.      0.796 women  382892
##  7 Interdisciplinary            12296       35000       0.771 men      2817
##  8 Interdisciplinary            12296       35000       0.771 women    9479
##  9 Communications & Journalism 392601       34738.      0.664 men    131921
## 10 Communications & Journalism 392601       34738.      0.664 women  260680
## # ... with 22 more rows

#How does gender breakdown relate to typical earnings?

## List of 1
##  $ legend.postion: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

#What is a ShareWomen and Average Salary Correlation?

majors_processed <- recent_grads %>%
  arrange(desc(median)) %>%
  mutate(major = str_to_title(major),
         major = fct_reorder(major, median))
by_major_category <- majors_processed %>%
  filter(!is.na(total)) %>%
  group_by(major_category) %>%
  summarize(men = sum(men),
            women = sum(women),
            total = sum(total),
            MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
  mutate(Sharewomen = women / total) %>%
  arrange(desc(Sharewomen))
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
g <- majors_processed %>%
  mutate(major_category = fct_lump(major_category, 4)) %>%
  ggplot(aes(sharewomen, median, color = major_category, size = sample_size, label = major)) +
  geom_point() +
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = dollar_format()) +
  expand_limits(y = 0)
ggplotly(g)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).