In this project, we explore the college_recent_grads datasaet in the fivethirtyeight package.

Start with loading the packages: tidyverse, scales, fivethirtyeight

Here are some questions we might want to answer with these data:

• Which major has the lowest unemployment rate?

• Which major has the highest percentage of women?

• How do the distributions of median income compare across major categories?

• Do women tend to choose majors with lower or higher earnings?

glimpse(college_recent_grads)
## Rows: 173
## Columns: 21
## $ rank                        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
## $ major_code                  <int> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
## $ major                       <chr> "Petroleum Engineering", "Mining And Miner…
## $ major_category              <chr> "Engineering", "Engineering", "Engineering…
## $ total                       <int> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
## $ sample_size                 <int> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
## $ men                         <int> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
## $ women                       <int> 282, 77, 131, 135, 11021, 373, 1667, 960, …
## $ sharewomen                  <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132…
## $ employed                    <int> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
## $ employed_fulltime           <int> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
## $ employed_parttime           <int> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
## $ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
## $ unemployed                  <int> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
## $ unemployment_rate           <dbl> 0.018380527, 0.117241379, 0.024096386, 0.0…
## $ p25th                       <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
## $ median                      <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
## $ p75th                       <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
## $ college_jobs                <int> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
## $ non_college_jobs            <int> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
## $ low_wage_jobs               <int> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…

Which major has the lowest unemployment rate?

In order to answer this question all we need to do is sort the data. We use the arrange function to do this, and sort it by the unemployment_rate variable.

college_recent_grads %>%
  arrange(unemployment_rate)%>%
  select(rank, major, unemployment_rate)%>%
  #remove decimal places in the unemployment variable
  mutate(unemployment_rate = percent(unemployment_rate))
## # A tibble: 173 × 3
##     rank major                                      unemployment_rate
##    <int> <chr>                                      <chr>            
##  1    53 Mathematics And Computer Science           0.00000%         
##  2    74 Military Technologies                      0.00000%         
##  3    84 Botany                                     0.00000%         
##  4   113 Soil Science                               0.00000%         
##  5   121 Educational Administration And Supervision 0.00000%         
##  6    15 Engineering Mechanics Physics And Science  0.63343%         
##  7    20 Court Reporting                            1.16897%         
##  8   120 Mathematics Teacher Education              1.62028%         
##  9     1 Petroleum Engineering                      1.83805%         
## 10    65 General Agriculture                        1.96425%         
## # ℹ 163 more rows

Which major has the highest percentage of women?

To answer this question, we need to arrange the data in a descending order. From the output below, we see that Early Childhood Education is the major with the most number of women enrolled.

college_recent_grads %>%
  arrange(desc(sharewomen)) %>%
  select(major, total, sharewomen) %>%
  top_n(3) #show the top 3 majors
## Selecting by sharewomen
## # A tibble: 3 × 3
##   major                                         total sharewomen
##   <chr>                                         <int>      <dbl>
## 1 Early Childhood Education                     37589      0.969
## 2 Communication Disorders Sciences And Services 38279      0.968
## 3 Medical Assisting Services                    11123      0.928

How do the distributions of median income compare across major categories?

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall.

To answer this question, we need to group the data by major_category. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data. We use the ggplot() function to do this. The first argument is the data frame, and the next argument gives the mapping of the variables of the data to the aesthetic elements of the plot.

ggplot(data = college_recent_grads, mapping = aes(x = median)) + 
  geom_histogram(binwidth = 5000)+ 
  ggtitle("Histogram of Median Salaries for Recent College Graduates")

We can also calculate the summary statistics for this distribution.

college_recent_grads %>%
  summarise(min = min(median), max = max(median),
            mean = mean(median), med = median(median), sd = sd(median),
            q1 = quantile(median, probs = 0.25),
            q3 = quantile(median, probs = 0.75))
## # A tibble: 1 × 7
##     min    max   mean   med     sd    q1    q3
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000

The histogram of median salaries for recent college graduates shows a right-skewed distribution, which means there are a larger number of graduates with lower median salaries and a few graduates with significantly higher median salaries.

For this distribution, the median is a more robust measure of central tendency than the mean due to the right-skewed nature of the data.

The maximum salary being much higher than the other values indicates the presence of potential outliers. Therefore, using the median and IQR would provide a more accurate description of the central tendency and variability for this distribution of median salaries.

Plot the distribution of median income using a histogram, faceted by major_category.

ggplot(college_recent_grads, mapping = aes(x = median), fill = major_category)+
  geom_histogram(binwidth = 100000)+
  facet_wrap(~major_category, nrow = 1)

From the histogram above and the table below, we see that the engineering major has the highest median salaries

Which major category has the highest typical median income?

college_recent_grads %>% 
  group_by(major_category) %>% 
  summarise(typical_median_income = median(median)) %>%
  arrange(desc(typical_median_income))
## # A tibble: 16 × 2
##    major_category                      typical_median_income
##    <chr>                                               <dbl>
##  1 Engineering                                         57000
##  2 Computers & Mathematics                             45000
##  3 Business                                            40000
##  4 Physical Sciences                                   39500
##  5 Social Science                                      38000
##  6 Biology & Life Science                              36300
##  7 Law & Public Policy                                 36000
##  8 Agriculture & Natural Resources                     35000
##  9 Communications & Journalism                         35000
## 10 Health                                              35000
## 11 Industrial Arts & Consumer Services                 35000
## 12 Interdisciplinary                                   35000
## 13 Education                                           32750
## 14 Humanities & Liberal Arts                           32000
## 15 Arts                                                30750
## 16 Psychology & Social Work                            30000

Which major category is the least popular in this sample?

college_recent_grads %>%
  count(major_category)%>%
  arrange(n)
## # A tibble: 16 × 2
##    major_category                          n
##    <chr>                               <int>
##  1 Interdisciplinary                       1
##  2 Communications & Journalism             4
##  3 Law & Public Policy                     5
##  4 Industrial Arts & Consumer Services     7
##  5 Arts                                    8
##  6 Psychology & Social Work                9
##  7 Social Science                          9
##  8 Agriculture & Natural Resources        10
##  9 Physical Sciences                      10
## 10 Computers & Mathematics                11
## 11 Health                                 12
## 12 Business                               13
## 13 Biology & Life Science                 14
## 14 Humanities & Liberal Arts              15
## 15 Education                              16
## 16 Engineering                            29

The table above shows that the least popular majors are interdisplinary, communications and journalism and law and public policy.

Which STEM majors have median salaries equal to or less than the median for all majors’ median earnings?

#a stem categories vector with all majors considered STEM
stem_categories <- c("Biology & Life Science", "Computers & Mathematics",
"Engineering", "Physical Sciences")   

# use this to create a new variable in our data frame indicating whether a major is STEM or not.
college_recent_grads <- college_recent_grads %>%
  mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))

#use the logical operators to also filter our data for STEM majors whose median earnings is less than median for all majors’ median earnings, which we found to be $36,000 earlier.
college_recent_grads %>% filter(
  major_type == "stem", median < 36000 )%>%
  select(major, median, p25th, p75th)%>%
  arrange(desc(median))
## # A tibble: 10 × 4
##    major                                 median p25th p75th
##    <chr>                                  <dbl> <dbl> <dbl>
##  1 Environmental Science                  35600 25000 40200
##  2 Multi-Disciplinary Or General Science  35000 24000 50000
##  3 Physiology                             35000 20000 50000
##  4 Communication Technologies             35000 25000 45000
##  5 Neuroscience                           35000 30000 44000
##  6 Atmospheric Sciences And Meteorology   35000 28000 50000
##  7 Miscellaneous Biology                  33500 23000 48000
##  8 Biology                                33400 24000 45000
##  9 Ecology                                33000 23000 42000
## 10 Zoology                                26000 20000 39000

What types of majors do women tend to major in?

Create a scatterplot of median income vs. proportion of women in that major colored by whether the major is in a STEM field or not.

college_recent_grads %>%
  group_by(major) %>%
  summarise(median_income = median(median), proportion_women = mean(sharewomen), major_type = unique(major_type)) %>%
  ggplot(aes(x = median_income, y = proportion_women, color = major_type)) +
  geom_point() +
  labs(title = "Scatterplot of Median Income vs. Proportion of Women by Major",
       x = "Median Income",
       y = "Proportion of Women",
       color = "Major Type") +
  scale_color_manual(values = c("blue", "green"))

This scatterplot shows that majors with a smaller proportion of women tend to have higher median incomes and are predominantly STEM fields. In contrast, majors with higher proportions of women tend to have lower median incomes and are mostly non-STEM fields. This highlights a potential gender disparity in the distribution of earnings across different fields of study.

Reference:

https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html