Title:Economic Guide to Picking a College Major - A tidy analysis source of data :fivethirtyeight – Data Summary ———————— Values
Name recent_grads Number of rows 173
Number of columns 21
_______________________
Column type frequency:
character 2
numeric 19
________________________
Group variables None
Objective of the project:
This is the code behind an analysis of “College Major picking and Income.”
Why is this information important?
Gives future college students an idea of what jobs are making the most money
How does gender play a role in the earnings gap and picking major in collage?
Sheds light on the gender gap in earnings between jobs
What type of data are we looking at?
Number of individuals graduating with a specific major
Number of women and men graduating within each specific major
Median income for all the jobs as well as the share of women for each job
What are trending Major Category vs total number of graduates?
Trends of common majors and common major category
Key findings:
Men dominated share of workers for Computer science and most business degrees
Women dominated share of workers for elementary education, nursing, and graphic design
Psychology was by far the most common major
Graphic Design was the least common major Engineering was a top earner
Education was on the lower end of the income spectrum
Negative correlation between median earnings and share of women
#install.packages("fivethirtyeight")
#install.packages("skimr")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 4.1.3
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
library(skimr)
## Warning: package 'skimr' was built under R version 4.1.3
theme_set(theme_light())
recent_grads <-college_recent_grads
#Data summary
recent_grads <-college_recent_grads
skim(recent_grads)
| Name | recent_grads |
| Number of rows | 173 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| major | 0 | 1 | 5 | 65 | 0 | 173 | 0 |
| major_category | 0 | 1 | 4 | 35 | 0 | 16 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rank | 0 | 1.00 | 87.00 | 50.08 | 1 | 44.00 | 87.00 | 130.00 | 173.00 | ▇▇▇▇▇ |
| major_code | 0 | 1.00 | 3879.82 | 1687.75 | 1100 | 2403.00 | 3608.00 | 5503.00 | 6403.00 | ▃▇▅▃▇ |
| total | 1 | 0.99 | 39370.08 | 63483.49 | 124 | 4549.75 | 15104.00 | 38909.75 | 393735.00 | ▇▁▁▁▁ |
| sample_size | 0 | 1.00 | 356.08 | 618.36 | 2 | 39.00 | 130.00 | 338.00 | 4212.00 | ▇▁▁▁▁ |
| men | 1 | 0.99 | 16723.41 | 28122.43 | 119 | 2177.50 | 5434.00 | 14631.00 | 173809.00 | ▇▁▁▁▁ |
| women | 1 | 0.99 | 22646.67 | 41057.33 | 0 | 1778.25 | 8386.50 | 22553.75 | 307087.00 | ▇▁▁▁▁ |
| sharewomen | 1 | 0.99 | 0.52 | 0.23 | 0 | 0.34 | 0.53 | 0.70 | 0.97 | ▂▆▆▇▃ |
| employed | 0 | 1.00 | 31192.76 | 50675.00 | 0 | 3608.00 | 11797.00 | 31433.00 | 307933.00 | ▇▁▁▁▁ |
| employed_fulltime | 0 | 1.00 | 26029.31 | 42869.66 | 111 | 3154.00 | 10048.00 | 25147.00 | 251540.00 | ▇▁▁▁▁ |
| employed_parttime | 0 | 1.00 | 8832.40 | 14648.18 | 0 | 1030.00 | 3299.00 | 9948.00 | 115172.00 | ▇▁▁▁▁ |
| employed_fulltime_yearround | 0 | 1.00 | 19694.43 | 33160.94 | 111 | 2453.00 | 7413.00 | 16891.00 | 199897.00 | ▇▁▁▁▁ |
| unemployed | 0 | 1.00 | 2416.33 | 4112.80 | 0 | 304.00 | 893.00 | 2393.00 | 28169.00 | ▇▁▁▁▁ |
| unemployment_rate | 0 | 1.00 | 0.07 | 0.03 | 0 | 0.05 | 0.07 | 0.09 | 0.18 | ▂▇▆▁▁ |
| p25th | 0 | 1.00 | 29501.45 | 9166.01 | 18500 | 24000.00 | 27000.00 | 33000.00 | 95000.00 | ▇▂▁▁▁ |
| median | 0 | 1.00 | 40151.45 | 11470.18 | 22000 | 33000.00 | 36000.00 | 45000.00 | 110000.00 | ▇▅▁▁▁ |
| p75th | 0 | 1.00 | 51494.22 | 14906.28 | 22000 | 42000.00 | 47000.00 | 60000.00 | 125000.00 | ▅▇▂▁▁ |
| college_jobs | 0 | 1.00 | 12322.64 | 21299.87 | 0 | 1675.00 | 4390.00 | 14444.00 | 151643.00 | ▇▁▁▁▁ |
| non_college_jobs | 0 | 1.00 | 13284.50 | 23789.66 | 0 | 1591.00 | 4595.00 | 11783.00 | 148395.00 | ▇▁▁▁▁ |
| low_wage_jobs | 0 | 1.00 | 3859.02 | 6945.00 | 0 | 340.00 | 1231.00 | 3466.00 | 48207.00 | ▇▁▁▁▁ |
#summary(recent_grads)
#What are trending Major Category vs total number of graduates?
#Similarly, I can just group by the Majors column to get a bar plot for the most common majors. But, since we have 173 majors, I will only look at the top 20 most common majors.
## List of 1
## $ legend.postion: chr "none"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
recent_grads %>%
mutate(major_category = fct_reorder(major_category, median)) %>%
ggplot(aes(major_category, median, fill = major_category)) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar_format()) +
expand_limits(y = 0) +
coord_flip() +
labs( x = "",
y = "Total # of graduates",title = "What categories of majors make more money than others?") +
theme(legend.position = "none")
#What is the top earning Major Category based on the Median Salary? There are two plots I can use to answer this question i.e. a bar plot and a box plot. In the EDA phase, I can play around with both plots but while reporting I would drop one of them.
For the bar plot, I would group by the Major Category, and apply the median aggregate function on Median Salary. This would get me the top median salaries per major category.
recent_grads %>%
group_by(major_category) %>%
summarise()
## # A tibble: 16 x 1
## major_category
## <chr>
## 1 Agriculture & Natural Resources
## 2 Arts
## 3 Biology & Life Science
## 4 Business
## 5 Communications & Journalism
## 6 Computers & Mathematics
## 7 Education
## 8 Engineering
## 9 Health
## 10 Humanities & Liberal Arts
## 11 Industrial Arts & Consumer Services
## 12 Interdisciplinary
## 13 Law & Public Policy
## 14 Physical Sciences
## 15 Psychology & Social Work
## 16 Social Science
theme_set(theme_light())
#What are the highest earning majors?
As discussed in previous section, there is a lot of variation in Sample Sizes. Hence, I will filter on Sample Size > 100 and sort by Median in descending order to get only the top 20 majors.
#Arrange by Gender
recent_grads %>%
filter(!is.na(total)) %>%
group_by(major_category) %>%
summarize(men = sum(men),
women = sum(women),
total = sum(total),
MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
mutate(ShareWomen = women / total) %>%
gather(gender,number,men,women)%>%
arrange(desc(ShareWomen))
## # A tibble: 32 x 6
## major_category total MedianSalary ShareWomen gender number
## <chr> <int> <dbl> <dbl> <chr> <int>
## 1 Health 463230 43694. 0.837 men 75517
## 2 Health 463230 43694. 0.837 women 387713
## 3 Education 559129 32364. 0.815 men 103526
## 4 Education 559129 32364. 0.815 women 455603
## 5 Psychology & Social Work 481007 31234. 0.796 men 98115
## 6 Psychology & Social Work 481007 31234. 0.796 women 382892
## 7 Interdisciplinary 12296 35000 0.771 men 2817
## 8 Interdisciplinary 12296 35000 0.771 women 9479
## 9 Communications & Journalism 392601 34738. 0.664 men 131921
## 10 Communications & Journalism 392601 34738. 0.664 women 260680
## # ... with 22 more rows
#How does gender breakdown relate to typical earnings?
## List of 1
## $ legend.postion: chr "none"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
#What is a ShareWomen and Average Salary Correlation?
majors_processed <- recent_grads %>%
arrange(desc(median)) %>%
mutate(major = str_to_title(major),
major = fct_reorder(major, median))
by_major_category <- majors_processed %>%
filter(!is.na(total)) %>%
group_by(major_category) %>%
summarize(men = sum(men),
women = sum(women),
total = sum(total),
MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
mutate(Sharewomen = women / total) %>%
arrange(desc(Sharewomen))
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
g <- majors_processed %>%
mutate(major_category = fct_lump(major_category, 4)) %>%
ggplot(aes(sharewomen, median, color = major_category, size = sample_size, label = major)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = dollar_format()) +
expand_limits(y = 0)
ggplotly(g)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).