Title:Economic Guide to Picking a College Major - A tidy analysis source of data :fivethirtyeight – Data Summary ———————— Values
Name recent_grads Number of rows 173
Number of columns 21
_______________________
Column type frequency:
character 2
numeric 19
________________________
Group variables None

Women dominated share of workers for elementary education, nursing, and graphic design

Data Exploring

#install.packages("fivethirtyeight")
#install.packages("skimr")
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(scales)

## Warning: package 'scales' was built under R version 4.1.3

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(fivethirtyeight)

## Warning: package 'fivethirtyeight' was built under R version 4.1.3

## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

library(skimr)

## Warning: package 'skimr' was built under R version 4.1.3

theme_set(theme_light())
recent_grads <-college_recent_grads

#Data summary

recent_grads <-college_recent_grads
skim(recent_grads)

Data summary
Name	recent_grads
Number of rows	173
Number of columns	21
_______________________
Column type frequency:
character	2
numeric	19
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
major	0	1	5	65	0	173	0
major_category	0	1	4	35	0	16	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	0	1.00	87.00	50.08	1	44.00	87.00	130.00	173.00	▇▇▇▇▇
major_code	0	1.00	3879.82	1687.75	1100	2403.00	3608.00	5503.00	6403.00	▃▇▅▃▇
total	1	0.99	39370.08	63483.49	124	4549.75	15104.00	38909.75	393735.00	▇▁▁▁▁
sample_size	0	1.00	356.08	618.36	2	39.00	130.00	338.00	4212.00	▇▁▁▁▁
men	1	0.99	16723.41	28122.43	119	2177.50	5434.00	14631.00	173809.00	▇▁▁▁▁
women	1	0.99	22646.67	41057.33	0	1778.25	8386.50	22553.75	307087.00	▇▁▁▁▁
sharewomen	1	0.99	0.52	0.23	0	0.34	0.53	0.70	0.97	▂▆▆▇▃
employed	0	1.00	31192.76	50675.00	0	3608.00	11797.00	31433.00	307933.00	▇▁▁▁▁
employed_fulltime	0	1.00	26029.31	42869.66	111	3154.00	10048.00	25147.00	251540.00	▇▁▁▁▁
employed_parttime	0	1.00	8832.40	14648.18	0	1030.00	3299.00	9948.00	115172.00	▇▁▁▁▁
employed_fulltime_yearround	0	1.00	19694.43	33160.94	111	2453.00	7413.00	16891.00	199897.00	▇▁▁▁▁
unemployed	0	1.00	2416.33	4112.80	0	304.00	893.00	2393.00	28169.00	▇▁▁▁▁
unemployment_rate	0	1.00	0.07	0.03	0	0.05	0.07	0.09	0.18	▂▇▆▁▁
p25th	0	1.00	29501.45	9166.01	18500	24000.00	27000.00	33000.00	95000.00	▇▂▁▁▁
median	0	1.00	40151.45	11470.18	22000	33000.00	36000.00	45000.00	110000.00	▇▅▁▁▁
p75th	0	1.00	51494.22	14906.28	22000	42000.00	47000.00	60000.00	125000.00	▅▇▂▁▁
college_jobs	0	1.00	12322.64	21299.87	0	1675.00	4390.00	14444.00	151643.00	▇▁▁▁▁
non_college_jobs	0	1.00	13284.50	23789.66	0	1591.00	4595.00	11783.00	148395.00	▇▁▁▁▁
low_wage_jobs	0	1.00	3859.02	6945.00	0	340.00	1231.00	3466.00	48207.00	▇▁▁▁▁

#summary(recent_grads)

#What are trending Major Category vs total number of graduates?

#Similarly, I can just group by the Majors column to get a bar plot for the most common majors. But, since we have 173 majors, I will only look at the top 20 most common majors.

## List of 1
##  $ legend.postion: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

What categories of majors make more money than others?

recent_grads %>%
  mutate(major_category = fct_reorder(major_category, median)) %>%
  ggplot(aes(major_category, median, fill = major_category)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar_format()) +
  expand_limits(y = 0) +
  coord_flip() +
  labs( x = "",
        y = "Total # of graduates",title = "What categories of majors make more money than others?") +
  theme(legend.position = "none")

#What is the top earning Major Category based on the Median Salary? There are two plots I can use to answer this question i.e. a bar plot and a box plot. In the EDA phase, I can play around with both plots but while reporting I would drop one of them.

For the bar plot, I would group by the Major Category, and apply the median aggregate function on Median Salary. This would get me the top median salaries per major category.

recent_grads %>%
  group_by(major_category) %>%
  summarise()

## # A tibble: 16 x 1
##    major_category                     
##    <chr>                              
##  1 Agriculture & Natural Resources    
##  2 Arts                               
##  3 Biology & Life Science             
##  4 Business                           
##  5 Communications & Journalism        
##  6 Computers & Mathematics            
##  7 Education                          
##  8 Engineering                        
##  9 Health                             
## 10 Humanities & Liberal Arts          
## 11 Industrial Arts & Consumer Services
## 12 Interdisciplinary                  
## 13 Law & Public Policy                
## 14 Physical Sciences                  
## 15 Psychology & Social Work           
## 16 Social Science

theme_set(theme_light())

#What are the highest earning majors?
As discussed in previous section, there is a lot of variation in Sample Sizes. Hence, I will filter on Sample Size > 100 and sort by Median in descending order to get only the top 20 majors.

#Arrange by Gender

recent_grads %>%
  filter(!is.na(total)) %>%
  group_by(major_category) %>%
  summarize(men = sum(men),
            women = sum(women),
            total = sum(total),
            MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
  mutate(ShareWomen = women / total) %>%
  gather(gender,number,men,women)%>%
  arrange(desc(ShareWomen))

## # A tibble: 32 x 6
##    major_category               total MedianSalary ShareWomen gender number
##    <chr>                        <int>        <dbl>      <dbl> <chr>   <int>
##  1 Health                      463230       43694.      0.837 men     75517
##  2 Health                      463230       43694.      0.837 women  387713
##  3 Education                   559129       32364.      0.815 men    103526
##  4 Education                   559129       32364.      0.815 women  455603
##  5 Psychology & Social Work    481007       31234.      0.796 men     98115
##  6 Psychology & Social Work    481007       31234.      0.796 women  382892
##  7 Interdisciplinary            12296       35000       0.771 men      2817
##  8 Interdisciplinary            12296       35000       0.771 women    9479
##  9 Communications & Journalism 392601       34738.      0.664 men    131921
## 10 Communications & Journalism 392601       34738.      0.664 women  260680
## # ... with 22 more rows

#How does gender breakdown relate to typical earnings?

## List of 1
##  $ legend.postion: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

#What is a ShareWomen and Average Salary Correlation?

majors_processed <- recent_grads %>%
  arrange(desc(median)) %>%
  mutate(major = str_to_title(major),
         major = fct_reorder(major, median))

by_major_category <- majors_processed %>%
  filter(!is.na(total)) %>%
  group_by(major_category) %>%
  summarize(men = sum(men),
            women = sum(women),
            total = sum(total),
            MedianSalary = sum(median * sample_size) / sum(sample_size)) %>%
  mutate(Sharewomen = women / total) %>%
  arrange(desc(Sharewomen))

library(plotly)

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

g <- majors_processed %>%
  mutate(major_category = fct_lump(major_category, 4)) %>%
  ggplot(aes(sharewomen, median, color = major_category, size = sample_size, label = major)) +
  geom_point() +
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = dollar_format()) +
  expand_limits(y = 0)
ggplotly(g)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

final project

sileshi

4/28/2022

Data Exploring

What categories of majors make more money than others?