library(tidyverse)
library(knitr)
library(kableExtra)
library(stringr)
library(DT)
library(tools)
library(prettydoc)

Part 1 - Introduction

Which college majors provide the highest return on investment?

College is a pricey business. Yearly tuition fees for private four year US colleges average out at $35,830, while public college out-of-state fees cost on average $26,290 per year. (source: https://www.topuniversities.com/student-info/student-finance/how-much-does-it-cost-study-us ) Once living and other expenses are factored in, going to college is a sizable investment, and one that deserves careful consideration before parting with your money.

Considering that it can take up to 30 years to pay off your student debt, it is important to pick a major that leads to both employment, and a large enough salary to repay your debts without struggling to make ends meet. This investigation aims to enable prospective students to make wise major choices based on employment rates, and salary medians.

Part 2 - Data

Data Collection

The original data was collected by the U.S. Census Bureau as part of the 2010-2012 American Community Survey (ACS). FiveThirtyEight (https://fivethirtyeight.com) used this data to construct a new dataset that acted as the factual foundation for their “The Economic Guide To Picking A College Major” (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major) story. This data project will utilize FiveThirtyEight’s version of the data.

The dataset can be found at the following location: https://github.com/fivethirtyeight/data/blob/master/college-majors/all-ages.csv

Cases

There are 173 cases in this study. Each case represents an undergraduate or graduate college major offered by a US college.

Variables

Major: Qualitative response variable.

Unemployment Rate: Quantitative independent variable.

Median Income: Quantitative independent variable.

Type of study

This is an observational study examing college majors and their relation to employment and earnings.

Employment Data for All Graduates dataset.

grad_employment_csv <- 'https://raw.githubusercontent.com/stephen-haslett/data606/data606-final-project/final_data_project/all-ages.csv'
all_grads_employment <- read.csv(url(grad_employment_csv), header = TRUE)

# Convert Major titles to title case, and convert instances of the word 'and' to ampersands.
all_grads_employment$Major <- tolower(all_grads_employment$Major)
all_grads_employment$Major <- toTitleCase(all_grads_employment$Major)
all_grads_employment$Major <- as.character(gsub(" and "," & ", all_grads_employment$Major))

datatable(all_grads_employment)

Part 3 - Exploratory data analysis

Unemployment rate summary for all majors.

summary(all_grads_employment$Unemployment_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04626 0.05472 0.05736 0.06904 0.15615

Highest Unemployment rates by major.

unemployment_by_major <- top_n(all_grads_employment, 20, Unemployment_rate)
unemployment_by_major_plot <- ggplot(unemployment_by_major, aes(x = reorder(Major, - Unemployment_rate), y = Unemployment_rate)) +
   geom_bar(stat = 'identity', fill = 'red') +
    labs(title = 'Highest Unemployment Rates by Major',
         x = 'Major',
         y = 'Unemployment Rate') +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    theme(panel.background = element_rect(fill = '#FFFFFF'))

unemployment_by_major_plot

The above bar graph shows the 20 highest unemployment rates by major subject. The majorty of majors with high unemployment rates are Liberal Arts related majors. However, It is surprising to see Computer Programming & Data Processing, and Biomedical Engineering in the results. I believe we can attribute this to the fact that there were only 170 cases in the study. This assumption is further consolidated by the fact that the unemployment rate for all majors in the bar chart centers around 0.08, with little fluctuation.

Lowest Unemployment rates by major.

lowest_unemployment_by_major <- top_n(all_grads_employment, - 20, Unemployment_rate)
lowest_unemployment_by_major_plot <- ggplot(lowest_unemployment_by_major, aes(x = reorder(Major, + Unemployment_rate), y = Unemployment_rate)) +
   geom_bar(stat = 'identity', fill = 'blue') +
    labs(title = 'Lowest Unemployment Rates by Major',
         x = 'Major',
         y = 'Unemployment Rate') +
    theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
    theme(panel.background = element_rect(fill = '#FFFFFF'))

lowest_unemployment_by_major_plot

The above bar graph shows the 20 lowest unemployment rates by major subject. I am not surprised that most of the results are for STEM related majors with the exception of agriculture, education, and health related majors. This would make sense as the demand for these particular skill sets is fairly constant.

Median income summary

summary(all_grads_employment$Median)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35000   46000   53000   56816   65000  125000
median_income_plot <- ggplot(all_grads_employment, aes(x = Median)) +
    geom_histogram(binwidth = 10000, color = "#FFFFFF", fill = "#4CAF50") +
    labs(title = "Median Income for All Majors",
         x = "Median Income",
         y = 'Frequency') +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    theme(panel.background = element_rect(fill = '#FFFFFF'))

median_income_plot

The distribution of income displays a right-skewed distribution. This can be atrributed to the fact that fewer job opportunites exist as income increases. We can also observe from the above bar graph that the majority of majors in the study are tied with a median income of $50,000.

Highest Median income by major.

highest_income_by_major <- top_n(all_grads_employment, 20, Median)
highest_income_by_major_plot <- ggplot(highest_income_by_major, aes(x = reorder(Major, - Median), y = Median)) +
   geom_bar(stat = 'identity', fill = '#01BEFE') +
    labs(title = 'Highest Median Incomes by Major',
         x = 'Major',
         y = 'Median Income') +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    theme(panel.background = element_rect(fill = '#FFFFFF'))

highest_income_by_major_plot

The above bar graph shows the 20 highest salaries by major subject. It is no surprise that the top 20 median salaries are tied with Science and Engineering related majors. Based on this information, it would seem that STEM related majors provide the higest return on investment.

Lowest Median income by major.

lowest_income_by_major <- top_n(all_grads_employment, - 20, Median)
lowest_income_by_major_plot <- ggplot(lowest_income_by_major, aes(x = reorder(Major, + Median), y = Median)) +
   geom_bar(stat = 'identity', fill = '#800080') +
    labs(title = 'Lowest Median Incomes by Major',
         x = 'Major',
         y = 'Median Income') +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    theme(panel.background = element_rect(fill = '#FFFFFF'))

lowest_income_by_major_plot

The above bar chart show the 20 lowest salaries by major subject. Again, It is no surprise that the majority of low salaries are tied to Liberal Arts related majors. However, I was surprised that Neuroscience reported the lowest median income. It is possible that this is due to the small amount of cases in the study (173), or the region that the responses were gathered from (i.e. salaries in NYC are higher than those in Milwaukee).

Part 4 - Inference

Hypothesis

H0: Incomes do not differ between majors.

HA: Incomes do differ between majors.

Satisfying conditions for inference:

  1. Independence of cases:

    Each observation is independant as it is unlikely that the income for 1 major has an effect on that of another.

  2. The sample size is greater than 30:

    There are 173 cases in this study, so this condition has been met.

  3. The samples are random:

    The data for this study was collected by the U.S. Census Bureau as part of the 2010-2012 American Community Survey. The fact that the data wes collected for the 2010-2012 American Community Survey would suggest that the dataset represents a random sampling of the US population. We can assume that this condition has been met.

  4. The data follows a normal distribution:

    The distribution is close to normal with outliers at both ends of the line. This condition has been met.

qqnorm(all_grads_employment$Median)
qqline(all_grads_employment$Median)

Given that the conditions for inference have been met, We can reject the Null hypothesis (Incomes do not differ between majors), and accept the Alternative hypothesis (Incomes do differ between majors). In other words, The major you decide to study has a significant influence over your potential earnings after graduation, so pick wisely.

Part 5 - Conclusion

From our initial research question: “Which college majors provide the highest return on investment?” we can conclude as follows:

The statistical analysis and hypothesis testing contained in this study suggest that the major a prospective student decides to study, is a significant determinant in their chances of employment after graduation, and their potential post-graduation earnings. When it comes to yearly income, it appears that STEM related majors lead to higher salaries, whilst Liberal Arts majors lead to lower salaries.

The same is true for the chances of employment after graduation. With the exception of agriculture, education, and health related majors, STEM Majors appear to be tied with low post graduation unemployment rates. The fact that agriculture, education, and health majors are included in these results, is most likely due to the fact that these are, and always will be, in demand careers. People will always get sick, and will always need to eat.

So to directly answer our research question “Which college majors provide the highest return on investment?”; STEM majors provide the highest return on investment.