With the rapid development of science and technology STEM majors are becoming more popular than other majors. Under the growing needs in science related jobs, I would like to study the relationship between median income and different major categories in STEM, business, and liberal arts.
Also, I would like to study the relationship between median income (Median) and the percentage of job requirement of being college graduates (Jobindex) by different majors and major categories in STEM, business, and liberal arts, among the recent graduates.
Every year, the U.S. Census Bureau contacts over 3.5 million households across the country to participate in the American Community Survey (ACS). ACS is an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help determine how more than $675 billion in federal and state funds are distributed each year.
Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics. Public officials, planners, and entrepreneurs use this information to assess the past and plan the future. (https://www.census.gov/programs-surveys/acs/about.html)
All three main .csv datasets are from American Community Survey 2010-2012 Public Use Microdata Series. They contain basic earnings and labor force information.
Download data here: http://www.census.gov/programs-surveys/acs/data/pums.html
Documentation here: http://www.census.gov/programs-surveys/acs/technical-documentation/pums.html
The datasets I used were obtained from github: https://github.com/fivethirtyeight/data/tree/master/college-majors.
# load data
library(tidyr)
library(dplyr)
library(tidyverse)
library(stringr)
library(ggplot2)
# all ages
temp1 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv", stringsAsFactors = FALSE)
# graduate students with ages>25
temp2 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv", stringsAsFactors = FALSE)
# recent graduates with ages<28
temp3 <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv", stringsAsFactors = FALSE)
# see the list of all 16 major categories
unique(temp1$Major_category)## [1] "Agriculture & Natural Resources"
## [2] "Biology & Life Science"
## [3] "Engineering"
## [4] "Humanities & Liberal Arts"
## [5] "Communications & Journalism"
## [6] "Computers & Mathematics"
## [7] "Industrial Arts & Consumer Services"
## [8] "Education"
## [9] "Law & Public Policy"
## [10] "Interdisciplinary"
## [11] "Health"
## [12] "Social Science"
## [13] "Physical Sciences"
## [14] "Psychology & Social Work"
## [15] "Arts"
## [16] "Business"
# categorize STEM, business and liberal arts
STEM <- c("Agriculture & Natural Resources", "Biology & Life Science", "Health",
"Computers & Mathematics", "Engineering", "Physical Sciences", "Social Science")
business <- c("Business")
liberalarts <- c("Communications & Journalism", "Education",
"Industrial Arts & Consumer Services","Law & Public Policy",
"Psychology & Social Work", "Arts", "Humanities & Liberal Arts",
"Interdisciplinary")# all_ages
all_ages <- temp1 %>%
#drop_na() %>%
mutate(Type = case_when(Major_category %in% liberalarts ~ "L",
Major_category %in% STEM ~ "S",
TRUE ~ "B")) %>%
arrange(Major_category) %>%
group_by(Major_category) %>%
mutate(Mean_median= round(mean(Median),0)) %>%
select(Major_category, Type, Employed, Unemployed, Unemployment_rate,
Median, P25th, P75th, Mean_median)
all_ages# grad_students
grad_students <- temp2 %>%
#drop_na() %>%
mutate(Type = case_when(Major_category %in% liberalarts ~ "L",
Major_category %in% STEM ~ "S",
TRUE ~ "B")) %>%
arrange(Major_category) %>%
group_by(Major_category) %>%
mutate(Mean_grad_median= round(mean(Grad_median),0)) %>%
select(Major_category, Type, Grad_employed, Grad_unemployed, Grad_unemployment_rate,
Grad_median, Grad_P25, Grad_P75, Mean_grad_median)
grad_studentsrecent_grads <- temp3 %>%
#drop_na() %>%
mutate(Type = case_when(Major_category %in% liberalarts ~ "L",
Major_category %in% STEM ~ "S",
TRUE ~ "B")) %>%
arrange(Major_category) %>%
group_by(Major_category) %>%
mutate(Mean_median= round(mean(Median),0)) %>%
mutate(Jobindex = round(
ifelse(Non_college_jobs==0, 1,
College_jobs/(Non_college_jobs + College_jobs)), 4)) %>%
mutate(Mean_jobindex = round(mean(
ifelse(Non_college_jobs==0, 1,
College_jobs/(Non_college_jobs + College_jobs))), 4)) %>%
select(Rank, Major_category, Type, Employed, Unemployed, Unemployment_rate,
Median, P25th, P75th, College_jobs, Non_college_jobs,
Jobindex, Mean_jobindex, Mean_median)
recent_gradsWhat are the cases, and how many are there?
All_ages: There are 173 cases in total, with no age limits in each case.
Grad_students: There are 173 cases in total, with age >25.
Recent_grads: There are 173 cases in total, with age <28. Although one row has some NA values, I do not need the columns with NA, therefore I do not have to drop that row.
Each case represents one unique major_code and major offered by colleges in United States.
I will be studying the median income Median, type of majors Type and the percentage of job requirement of being college graduates Jobindex.
Median and Jobindex are numerical and Type is categorical.
My dependent variables are: Major_category (qualitative) and Jobindex (quantitative).
My response variable is Median income Median. It is quantitative.
The type of study is observational. There is no experiment conducted and the data are collected from surveys.
The data are collected by American Community Survey (ACS) from households all over the country.
Although there are only 173 cases, this analysis can be generalized to the population.
However, there may be different combination of students, for example, different age groups of students from different states or cities, which may lead to a small differences when generlize the results to the population.
This is an observational study. It is not suitable to establish causal links between the variables of interest.
Listed below are summary statistics and visualizations of the data.
## Major_category Type Employed Unemployed
## Length:173 Length:173 Min. : 1492 Min. : 0
## Class :character Class :character 1st Qu.: 17281 1st Qu.: 1101
## Mode :character Mode :character Median : 56564 Median : 3619
## Mean : 166162 Mean : 9725
## 3rd Qu.: 142879 3rd Qu.: 8862
## Max. :2354398 Max. :147261
## Unemployment_rate Median P25th P75th
## Min. :0.00000 Min. : 35000 Min. :24900 Min. : 45800
## 1st Qu.:0.04626 1st Qu.: 46000 1st Qu.:32000 1st Qu.: 70000
## Median :0.05472 Median : 53000 Median :36000 Median : 80000
## Mean :0.05736 Mean : 56816 Mean :38697 Mean : 82506
## 3rd Qu.:0.06904 3rd Qu.: 65000 3rd Qu.:42000 3rd Qu.: 95000
## Max. :0.15615 Max. :125000 Max. :78000 Max. :210000
## Mean_median
## Min. :43000
## 1st Qu.:46080
## Median :53222
## Mean :56816
## 3rd Qu.:62400
## Max. :77759
# Median Income from All Ages
ggplot(all_ages, aes(x=Median, fill=..count..)) +
geom_histogram(bins=30, color="black") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
ggtitle("Median Income from All Ages") +
xlab("Median Salary") + ylab("Count")# Mean of Median Income from All Ages by Major Category
ggplot(all_ages, aes(x= reorder(Major_category, Mean_median), y=Mean_median, fill=Mean_median)) +
geom_bar(stat="identity", position=position_dodge(), width = 0.8) +
coord_flip() +
ggtitle("Mean of Median Income from All Ages by Major Category") +
xlab("Major Category") + ylab("Median Salary") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
geom_text(aes(label=Mean_median), color='white', size=3, vjust=0.4, hjust=1.1)## Major_category Type Grad_employed Grad_unemployed
## Length:173 Length:173 Min. : 1008 Min. : 0
## Class :character Class :character 1st Qu.: 12659 1st Qu.: 453
## Mode :character Mode :character Median : 28930 Median : 1179
## Mean : 94037 Mean : 3506
## 3rd Qu.:109944 3rd Qu.: 3329
## Max. :915341 Max. :35718
## Grad_unemployment_rate Grad_median Grad_P25 Grad_P75
## Min. :0.00000 Min. : 47000 Min. :24500 Min. : 65000
## 1st Qu.:0.02607 1st Qu.: 65000 1st Qu.:45000 1st Qu.: 93000
## Median :0.03665 Median : 75000 Median :50000 Median :108000
## Mean :0.03934 Mean : 76756 Mean :52597 Mean :112087
## 3rd Qu.:0.04805 3rd Qu.: 90000 3rd Qu.:60000 3rd Qu.:130000
## Max. :0.13851 Max. :135000 Max. :85000 Max. :294000
## Mean_grad_median
## Min. :55000
## 1st Qu.:66571
## Median :80292
## Mean :76756
## 3rd Qu.:85545
## Max. :94328
# Median Income from Graduate Students
ggplot(grad_students, aes(x=Grad_median, fill=..count..)) +
geom_histogram(bins=30, color="black") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
ggtitle("Median Income from Grad Students") +
xlab("Median Salary") + ylab("Count")# Mean of Median Income from Grad Students by Major Category
ggplot(grad_students, aes(x= reorder(Major_category, Mean_grad_median),
y=Mean_grad_median, fill=Mean_grad_median)) +
geom_bar(stat="identity", position=position_dodge(), width = 0.8) +
coord_flip() +
ggtitle("Mean of Median Income from Grad Students by Major Category") +
xlab("Major Category") + ylab("Median Salary") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
geom_text(aes(label=Mean_grad_median), color='white', size=3, vjust=0.4, hjust=1.1)## Rank Major_category Type Employed
## Min. : 1 Length:173 Length:173 Min. : 0
## 1st Qu.: 44 Class :character Class :character 1st Qu.: 3608
## Median : 87 Mode :character Mode :character Median : 11797
## Mean : 87 Mean : 31193
## 3rd Qu.:130 3rd Qu.: 31433
## Max. :173 Max. :307933
## Unemployed Unemployment_rate Median P25th
## Min. : 0 Min. :0.00000 Min. : 22000 Min. :18500
## 1st Qu.: 304 1st Qu.:0.05031 1st Qu.: 33000 1st Qu.:24000
## Median : 893 Median :0.06796 Median : 36000 Median :27000
## Mean : 2416 Mean :0.06819 Mean : 40151 Mean :29501
## 3rd Qu.: 2393 3rd Qu.:0.08756 3rd Qu.: 45000 3rd Qu.:33000
## Max. :28169 Max. :0.17723 Max. :110000 Max. :95000
## P75th College_jobs Non_college_jobs Jobindex
## Min. : 22000 Min. : 0 Min. : 0 Min. :0.0708
## 1st Qu.: 42000 1st Qu.: 1675 1st Qu.: 1591 1st Qu.:0.3635
## Median : 47000 Median : 4390 Median : 4595 Median :0.4674
## Mean : 51494 Mean : 12323 Mean : 13284 Mean :0.5106
## 3rd Qu.: 60000 3rd Qu.: 14444 3rd Qu.: 11783 3rd Qu.:0.6963
## Max. :125000 Max. :151643 Max. :148395 Max. :1.0000
## Mean_jobindex Mean_median
## Min. :0.2973 Min. :30100
## 1st Qu.:0.3855 1st Qu.:33062
## Median :0.5161 Median :36900
## Mean :0.5106 Mean :40151
## 3rd Qu.:0.6799 3rd Qu.:42745
## Max. :0.7137 Max. :57383
# Median Income from Recent Graduates
ggplot(recent_grads, aes(x=Median, fill=..count..)) +
geom_histogram(bins=30, color="black") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
ggtitle("Median Income from Recent Grads") +
xlab("Median Salary") + ylab("Count")# Mean of Median Income from Recent Graduates by Major Category
ggplot(recent_grads, aes(x= reorder(Major_category, Mean_median), y=Mean_median, fill=Mean_median)) +
geom_bar(stat="identity", position=position_dodge(), width = 0.8) +
coord_flip() +
ggtitle("Mean of Median Income from Recent Graduates by Major Category") +
xlab("Major Category") + ylab("Median Salary") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
geom_text(aes(label=Mean_median), color='white', size=3, vjust=0.4, hjust=1.1)# Mean of Job Index from Recent Graduates by Major Category
ggplot(recent_grads, aes(x= reorder(Major_category, Mean_jobindex),
y=Mean_jobindex, fill=Mean_jobindex)) +
geom_bar(stat="identity", position=position_dodge(), width = 0.8) +
coord_flip() +
ggtitle("Mean of Job Index from Recent Graduates by Major Category") +
xlab("Major Category") + ylab("Job Index") +
scale_fill_gradient(low='royalblue4', high='royalblue1') +
geom_text(aes(label=Mean_jobindex), color='white', size=3, vjust=0.4, hjust=1.1)This study will investigate if salary (median income) is independent of major categories (STEM, business, and liberal arts) and the percentage of job requirement of being college students (Mean_jobindex) among different major categories.
As all 173 observations are collected by surveys from all over the country, I assume that all of them are randomly collected and are independent of each other.
The sample size 173 is larger than 30, therefore it is considered sufficiently large for the following tests.
I will use hypothesis test to investigate the relationship between median income and major categories.
H0: Median incomes for STEM, Business and Liberal Arts majors are the same.
H1: Median incomes for STEM, Business and Liberal Arts majors are different.
As stated in the Exploratory section, we can see that the top three median incomes of graduate students are STEM majors and the fourth is business, while the top median income of recent graduates is Engineering and is about $14k higher than the second major Business. And among the three bar charts, the bottom incomes are mostly from Liberal Arts majors. It is highly possible that there is a relationship between median income and major categories. STEM majors tends to have higher salaries from recent grads, and even higher when becoming graduate students with more experiences. Business majors tends to have high salaries from recent grads, but the increase of salary does not match with the increase of years of experiences. Liberal Arts majors tends to have lower salaries when compared to the other two.
# for all_ages
boxplot(Median~Type, data=all_ages, col="skyblue",
main="For All Ages",
xlab = "Major Categories",
ylab = "Median Income",
names=c("Business","Liberal Arts","STEM"))df <- all_ages %>%
ungroup() %>%
group_by(Type) %>%
select(Type, Median) %>%
data.frame()
df %>% group_by(Type) %>%
summarize(min = min(Median),
q1 = quantile(Median, 0.25),
mean = mean(Median),
q3 = quantile(Median, 0.75),
max = max(Median))#library(DATA606)
#inference(y= all_ages$Median, x=all_ages$Type, est="mean", type="ht", null=0,
#alternative="greater", method="theoretical")Results from the inference function :
# for grad_students
boxplot(Grad_median~Type, data=grad_students, col="skyblue",
main="For Graduate Students",
xlab = "Major Categories",
ylab = "Median Income",
names=c("Business","Liberal Arts","STEM"))df <- grad_students %>%
ungroup() %>%
group_by(Type) %>%
select(Type, Grad_median) %>%
data.frame()
df %>% group_by(Type) %>%
summarize(min = min(Grad_median),
q1 = quantile(Grad_median, 0.25),
mean = mean(Grad_median),
q3 = quantile(Grad_median, 0.75),
max = max(Grad_median))#library(DATA606)
#inference(y= grad_students$Grad_median, x=grad_students$Type, est="mean", type="ht", null=0,
#alternative="greater", method="theoretical")Results from the inference function :
# for recent_grads
boxplot(Median~Type, data=recent_grads, col="skyblue",
main="For Recent Graduates",
xlab = "Major Categories",
ylab = "Median Income",
names=c("Business","Liberal Arts","STEM"))df <- recent_grads %>%
ungroup() %>%
group_by(Type) %>%
select(Type, Median) %>%
data.frame()
df %>% group_by(Type) %>%
summarize(min = min(Median),
q1 = quantile(Median, 0.25),
mean = mean(Median),
q3 = quantile(Median, 0.75),
max = max(Median))#library(DATA606)
#inference(y= recent_grads$Median, x=recent_grads$Type, est="mean", type="ht", null=0,
#alternative="greater", method="theoretical")Results from the inference function :
Wrap-Up
According to the boxplots and numbers above, STEM majors have the highest mean of median incomes and Liberal Arts majors have the lowest for all 3 sets of data. Also with the inference function, we have the p-values approximately equal to zero. Therefore we reject the null hypothesis. There is an obvious relationship between median income and major categories.
Next I will study the Independence for Graduate Status. Chi-Square test, which is used to determine whether our hypothesis should be rejected or not between two variables. Through individual p-values, an inferential statistic used to determine if there is a significant different between the means of two variables, we will then be able to conclude our study.
recent_grads_test1 <- recent_grads %>%
ungroup() %>%
select(Type, College_jobs, Non_college_jobs) %>%
arrange(College_jobs) %>%
.[-1,] # skip the row with '0's to prevent error
recent_grads_test1## Type College_jobs Non_college_jobs
## Length:172 Min. : 162 Min. : 50
## Class :character 1st Qu.: 1745 1st Qu.: 1594
## Mode :character Median : 4468 Median : 4604
## Mean : 12394 Mean : 13362
## 3rd Qu.: 14596 3rd Qu.: 11792
## Max. :151643 Max. :148395
##
## Pearson's Chi-squared test
##
## data: recent_grads_test1[, -1]
## X-squared = 729289, df = 171, p-value < 2.2e-16
Since the p-value is less than 0.05, we can reject the null hypothesis that the choice of major does not affects the rate of job requirement of being a college graduate. We accept the alternative hypothesis that choice of major does affect the rate of job requirement of being a college graduate, which means that, some majors may have lower job requirements (not a must to be a college graduate) and some may have higher job requirements (must be a college graduate).
(Job Index is the rate of job requirement of being a college graduate of each major.)
Job market has different requirement of candidates’ education levels and some job fields may have high demands on that. Next, I will study the relationship between median income and jobindex. A high demand in education level may lead to higher salaries. In our data, we only have the number of college jobs and number of non-college jobs information available in the recent_grads table. I will use these data to create a linear model to study the relationship between the rate of job requirement and median income of each major. We will also see if the residuals of the model show necessary behavior of Normal Distribution and Constant Variance.
recent_grads_test2 <- recent_grads %>%
ungroup() %>%
select(Type, Jobindex, Median, Mean_jobindex, Mean_median) %>%
arrange(desc(Jobindex)) %>%
.[-1,] # skip the row of 0-0-job data showing '1' as Jobindex
recent_grads_test2##
## Call:
## lm(formula = Median ~ Jobindex, data = recent_grads_test2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20037 -6555 -2255 5815 63309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29101 2256 12.902 < 2e-16 ***
## Jobindex 21763 4141 5.255 4.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10700 on 170 degrees of freedom
## Multiple R-squared: 0.1397, Adjusted R-squared: 0.1347
## F-statistic: 27.62 on 1 and 170 DF, p-value: 4.387e-07
ggplot(recent_grads_test2, aes(x=Jobindex, y=Median)) +
ggtitle("Linear Model of Jobindex vs Median") +
xlab("Job Index") + ylab("Median Salary") +
geom_point(color = 'royalblue1') +
geom_smooth(method='lm', formula=y~x) plot(resid(m2) ~ jitter(recent_grads_test2$Jobindex), col="royalblue2",
xlab = 'Rate of Job Requirement of Being College Graduate',
ylab = 'Residual of Median Income',
main = 'Residual of Predicted Median Income vs Rate of Job Requirement')
abline(h=0, col="red", lty=3)Since the p-values are approximately equal to 0, the relationship between job requirement of being college graduates and median incomes is statistically significant. However, the correlation between those two variables are weak with \(R^2\) being 0.1397. This means that only around 13.97% of the variability of median income can be explained by the rate of job requirement of being college graduates. Therefore, we cannot conclude their relationship as “job with higher education requirement provides higher salaries”.
We found that major choices in college has significant effects in their median incomes. This effect is seen at all age levels, also among recent graduates and graduates. These findings show us that most STEM majors have higher salaries than Business and Liberal Arts majors. The result matches with my initial thoughts. Higher salaries may be one of the reason why STEM majors are becoming more popular among students than other majors besides the rapid development of science and technology. However, although there is statically significant relationship between the job requirement of being a college graduate and median incomes, there is no practically significance between them.
Trends in number of students in each major categories and their corresponding incomes can be tracked and measured by further surveys over time. Through tracking of our sample, we may study the affects of having higher education towards the rate of salary increase in each major categories.
We may also look at the relationship between unemployment rates and major categories using our current data, and also between gender and median salary if we can get our hands on the raw data.