In this project, we explore the college_recent_grads
datasaet in the fivethirtyeight package
.
Start with loading the packages: tidyverse, scales, fivethirtyeight
Here are some questions we might want to answer with these data:
• Which major has the lowest unemployment rate?
• Which major has the highest percentage of women?
• How do the distributions of median income compare across major categories?
• Do women tend to choose majors with lower or higher earnings?
glimpse(college_recent_grads)
## Rows: 173
## Columns: 21
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
## $ major_code <int> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
## $ major <chr> "Petroleum Engineering", "Mining And Miner…
## $ major_category <chr> "Engineering", "Engineering", "Engineering…
## $ total <int> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
## $ sample_size <int> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
## $ men <int> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
## $ women <int> 282, 77, 131, 135, 11021, 373, 1667, 960, …
## $ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132…
## $ employed <int> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
## $ employed_fulltime <int> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
## $ employed_parttime <int> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
## $ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
## $ unemployed <int> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
## $ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.0…
## $ p25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
## $ median <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
## $ p75th <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
## $ college_jobs <int> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
## $ non_college_jobs <int> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
## $ low_wage_jobs <int> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…
Which major has the lowest unemployment rate?
In order to answer this question all we need to do is sort the data.
We use the arrange
function to do this, and sort it by the
unemployment_rate variable
.
college_recent_grads %>%
arrange(unemployment_rate)%>%
select(rank, major, unemployment_rate)%>%
#remove decimal places in the unemployment variable
mutate(unemployment_rate = percent(unemployment_rate))
## # A tibble: 173 × 3
## rank major unemployment_rate
## <int> <chr> <chr>
## 1 53 Mathematics And Computer Science 0.00000%
## 2 74 Military Technologies 0.00000%
## 3 84 Botany 0.00000%
## 4 113 Soil Science 0.00000%
## 5 121 Educational Administration And Supervision 0.00000%
## 6 15 Engineering Mechanics Physics And Science 0.63343%
## 7 20 Court Reporting 1.16897%
## 8 120 Mathematics Teacher Education 1.62028%
## 9 1 Petroleum Engineering 1.83805%
## 10 65 General Agriculture 1.96425%
## # ℹ 163 more rows
Which major has the highest percentage of women?
To answer this question, we need to arrange the data in a descending order. From the output below, we see that Early Childhood Education is the major with the most number of women enrolled.
college_recent_grads %>%
arrange(desc(sharewomen)) %>%
select(major, total, sharewomen) %>%
top_n(3) #show the top 3 majors
## Selecting by sharewomen
## # A tibble: 3 × 3
## major total sharewomen
## <chr> <int> <dbl>
## 1 Early Childhood Education 37589 0.969
## 2 Communication Disorders Sciences And Services 38279 0.968
## 3 Medical Assisting Services 11123 0.928
How do the distributions of median income compare across major categories?
A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall.
To answer this question, we need to group the data by
major_category
. Then, we need a way to summarize the
distributions of median income within these groups. This decision will
depend on the shapes of these distributions. So first, we need to
visualize the data. We use the ggplot() function to do this. The first
argument is the data frame, and the next argument gives the mapping of
the variables of the data to the aesthetic elements of the plot.
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram(binwidth = 5000)+
ggtitle("Histogram of Median Salaries for Recent College Graduates")
We can also calculate the summary statistics for this distribution.
college_recent_grads %>%
summarise(min = min(median), max = max(median),
mean = mean(median), med = median(median), sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
## # A tibble: 1 × 7
## min max mean med sd q1 q3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000
The histogram of median salaries for recent college graduates shows a right-skewed distribution, which means there are a larger number of graduates with lower median salaries and a few graduates with significantly higher median salaries.
For this distribution, the median is a more robust measure of central tendency than the mean due to the right-skewed nature of the data.
The maximum salary being much higher than the other values indicates the presence of potential outliers. Therefore, using the median and IQR would provide a more accurate description of the central tendency and variability for this distribution of median salaries.
Plot the distribution of median income using a histogram, faceted by major_category.
ggplot(college_recent_grads, mapping = aes(x = median), fill = major_category)+
geom_histogram(binwidth = 100000)+
facet_wrap(~major_category, nrow = 1)
From the histogram above and the table below, we see that the
engineering major has the highest median salaries
Which major category has the highest typical median income?
college_recent_grads %>%
group_by(major_category) %>%
summarise(typical_median_income = median(median)) %>%
arrange(desc(typical_median_income))
## # A tibble: 16 × 2
## major_category typical_median_income
## <chr> <dbl>
## 1 Engineering 57000
## 2 Computers & Mathematics 45000
## 3 Business 40000
## 4 Physical Sciences 39500
## 5 Social Science 38000
## 6 Biology & Life Science 36300
## 7 Law & Public Policy 36000
## 8 Agriculture & Natural Resources 35000
## 9 Communications & Journalism 35000
## 10 Health 35000
## 11 Industrial Arts & Consumer Services 35000
## 12 Interdisciplinary 35000
## 13 Education 32750
## 14 Humanities & Liberal Arts 32000
## 15 Arts 30750
## 16 Psychology & Social Work 30000
Which major category is the least popular in this sample?
college_recent_grads %>%
count(major_category)%>%
arrange(n)
## # A tibble: 16 × 2
## major_category n
## <chr> <int>
## 1 Interdisciplinary 1
## 2 Communications & Journalism 4
## 3 Law & Public Policy 5
## 4 Industrial Arts & Consumer Services 7
## 5 Arts 8
## 6 Psychology & Social Work 9
## 7 Social Science 9
## 8 Agriculture & Natural Resources 10
## 9 Physical Sciences 10
## 10 Computers & Mathematics 11
## 11 Health 12
## 12 Business 13
## 13 Biology & Life Science 14
## 14 Humanities & Liberal Arts 15
## 15 Education 16
## 16 Engineering 29
The table above shows that the least popular majors are interdisplinary, communications and journalism and law and public policy.
Which STEM majors have median salaries equal to or less than the median for all majors’ median earnings?
#a stem categories vector with all majors considered STEM
stem_categories <- c("Biology & Life Science", "Computers & Mathematics",
"Engineering", "Physical Sciences")
# use this to create a new variable in our data frame indicating whether a major is STEM or not.
college_recent_grads <- college_recent_grads %>%
mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))
#use the logical operators to also filter our data for STEM majors whose median earnings is less than median for all majors’ median earnings, which we found to be $36,000 earlier.
college_recent_grads %>% filter(
major_type == "stem", median < 36000 )%>%
select(major, median, p25th, p75th)%>%
arrange(desc(median))
## # A tibble: 10 × 4
## major median p25th p75th
## <chr> <dbl> <dbl> <dbl>
## 1 Environmental Science 35600 25000 40200
## 2 Multi-Disciplinary Or General Science 35000 24000 50000
## 3 Physiology 35000 20000 50000
## 4 Communication Technologies 35000 25000 45000
## 5 Neuroscience 35000 30000 44000
## 6 Atmospheric Sciences And Meteorology 35000 28000 50000
## 7 Miscellaneous Biology 33500 23000 48000
## 8 Biology 33400 24000 45000
## 9 Ecology 33000 23000 42000
## 10 Zoology 26000 20000 39000
What types of majors do women tend to major in?
Create a scatterplot of median income vs. proportion of women in that major colored by whether the major is in a STEM field or not.
college_recent_grads %>%
group_by(major) %>%
summarise(median_income = median(median), proportion_women = mean(sharewomen), major_type = unique(major_type)) %>%
ggplot(aes(x = median_income, y = proportion_women, color = major_type)) +
geom_point() +
labs(title = "Scatterplot of Median Income vs. Proportion of Women by Major",
x = "Median Income",
y = "Proportion of Women",
color = "Major Type") +
scale_color_manual(values = c("blue", "green"))
This scatterplot shows that majors with a smaller proportion of women tend to have higher median incomes and are predominantly STEM fields. In contrast, majors with higher proportions of women tend to have lower median incomes and are mostly non-STEM fields. This highlights a potential gender disparity in the distribution of earnings across different fields of study.
Reference:
https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html