Introduction

Most friends around me have chosen either STEM or Business related majors instead of Liberal Arts. Also, Liberal Arts are always being considered as hard-to-get-a-job type of majors. The most popular dreams of kids are usually doctor, lawyer, or engineer. All these reasons raise my interest in investigating the relationship between major categories and unemployment rate. I therefore propose two research questions:

Is the average unemployment rate of STEM majors different than that of non-STEM majors?
Can unemployment rate be predicted by types of major type and median of income?

Data

Data collection

The data is from American Community Survey 2010-2012 Public Use Microdata Series.

The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help determine how more than $675 billion in federal and state funds are distributed each year.

Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.

*Reference: https://www.census.gov/programs-surveys/acs/about.html

Data Source

The data is from the source below:

https://github.com/fivethirtyeight/data/tree/master/college-majors

Data Preparation

Load data

# load data
library(tidyverse)
library(stringr)
library(ggpubr)
library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

raw_data <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv', stringsAsFactors = FALSE)
raw_data

Select distinct majors in column Major_Category, and create a character class storing all STEM majors in the raw dataset.

majors <- raw_data$Major_category %>%
  unique()

stem_majors <- c('Engineering','Physical Sciences','Computers & Mathematics',
                 'Agriculture & Natural Resources','Health','Social Science',
                 'Biology & Life Science')

create a new column named Is_STEM, if the major is STEM then STEM else Non-STEM. Select column as Major, Major_category, Is_STEM, Unemployment_rate as output cleaned data.

clean_data <- raw_data %>% 
  drop_na() %>%
  mutate(Is_STEM = 
           case_when(
             Major_category %in% stem_majors ~ 'STEM',
             str_detect(Major,'TECHNOLOG') ~ 'STEM',
             TRUE ~ 'Non-STEM')) %>%
  select(Major, 
         Major_category, 
         Is_STEM, 
         Median, 
         Unemployment_rate) %>%
  arrange(Major)
clean_data

Cases

There are 173 cases (observations) in total. After dropping the N/A value, there are 172 cases left in my cleaned dataset.

Each case represents the employment statistic of one major, and there are 172 majors in my cleaned dataset.

Variables

Dependent Variable

The response variable is Unemployment Rate. It’s quantitative

Independent Variable

The first independent variable is Is_STEM, which is qualitative;
The second independent variable is Median (median of income), which is quantitative.

Type of study

This is an observational study. There is no experiment done in this study but only surveys. The data of each respondent are being collected through the survey and tidied as a dataset.

Scope of Inference - generalizability

The population of interest in this study are recent college guaduates who are younger than 28 years old. The findings from this analysis can be generalized to the population as the data is collected by American Community Survey from target population all over the country. We assume the 173 observations are collected from different states and/or cities from the America and therefore this analysis can be generalized to the population.

Scope of Inference - causality

All data from this study is collected by survey, which means from observations, but not experients. Therefore these data cannot be used to establish causal links between the variables of interest.

Exploratory Data Analysis

The qualitative variable Is_STEM contains two values STEM and Non-STEM

ggplot(clean_data, aes(x = Is_STEM, fill = Is_STEM)) +
  geom_bar(color='black', fill = 'cyan3')+
  geom_text(stat='count', aes(label=..count..), vjust=0, hjust=1.5, color='white', face='bold')+
  coord_flip()+
  labs(title='Number of Cases by Type of Majors')+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position = "none")

## Warning: Ignoring unknown parameters: face

The summary of quantative variable median (median income) is as below:

summary(clean_data$Median)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40077   45000  110000

p1 <- ggplot(clean_data,aes(x=factor(0),Median))+geom_boxplot(color='black', fill = 'cyan1')+
  coord_flip()+
  ggtitle('Boxplot: Median Income')+
   theme(axis.title.y=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    plot.title = element_text(hjust=0.5))

p2 <- ggplot(clean_data, aes(x=Median, fill=..count..))+
  geom_histogram(bins=30,color="black")+
  scale_fill_gradient(low = 'cyan4', high = 'cyan1')+
  ggtitle('Histogram: Median Income')+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position = "none")

ggarrange(p2,p1,nrow=2)

Other variables might be used in future analysis

data_stem_top_20 <- clean_data %>% 
  filter(Is_STEM == 'STEM') %>%
  top_n(20)

## Selecting by Unemployment_rate

ggplot(data_stem_top_20, aes(x = reorder(Major,Unemployment_rate), y = Unemployment_rate, fill = Unemployment_rate)) +
  geom_bar(stat = 'identity')+
  geom_text(aes(label=paste0(round(Unemployment_rate,4)*100,'%')),vjust=0.4, hjust=1, position = position_dodge(width = 1), color="white",size = 3)+
  coord_flip()+
  labs(title='STEM Major Unemployment Rate', subtitle='Top 20')+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  xlab('STEM Major')+
  ylab('Unemployment Rate')+
  theme(legend.position = "none")+
  scale_fill_gradient(low = 'deeppink4', high = 'deeppink1')

data_stem_bottom_20 <- clean_data %>% 
  filter(Is_STEM == 'STEM') %>%
  top_n(-20)

## Selecting by Unemployment_rate

ggplot(data_stem_bottom_20, aes(x = reorder(Major,1-Unemployment_rate), y = 1-Unemployment_rate, fill = 1-Unemployment_rate)) +
  geom_bar(stat = 'identity')+
  geom_text(aes(label=paste0(round(1-Unemployment_rate,4)*100,'%')),vjust=0.4, hjust=1, position = position_dodge(width = 1), color="white",size = 3)+
  coord_flip()+
  labs(title='STEM Major \n Employment Rate', subtitle='Top 20')+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  xlab('STEM Major')+
  ylab('Employment Rate')+
  theme(legend.position = "none")

data_non_stem_top_20 <- clean_data %>% 
  filter(Is_STEM == 'Non-STEM') %>%
  top_n(20)

## Selecting by Unemployment_rate

ggplot(data_non_stem_top_20, aes(x = reorder(Major,Unemployment_rate), y = Unemployment_rate, fill = Unemployment_rate)) +
  geom_bar(stat = 'identity')+
  geom_text(aes(label=paste0(round(Unemployment_rate,4)*100,'%')),vjust=0.4, hjust=1, position = position_dodge(width = 1), color="white",size = 3)+
  coord_flip()+
  labs(title='Non-STEM Major Unemployment Rate', subtitle='Top 20')+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  xlab('Non-STEM Major')+
  ylab('Unemployment Rate')+
  theme(legend.position = "none")+
  scale_fill_gradient(low = 'deeppink4', high = 'deeppink1')

data_non_stem_bottom_20 <- clean_data %>% 
  filter(Is_STEM == 'Non-STEM') %>%
  top_n(-20)

## Selecting by Unemployment_rate

ggplot(data_non_stem_bottom_20, aes(x = reorder(Major,1-Unemployment_rate), y = 1-Unemployment_rate, fill = 1-Unemployment_rate)) +
  geom_bar(stat = 'identity')+
  geom_text(aes(label=paste0(round(1-Unemployment_rate,4)*100,'%')),vjust=0.4, hjust=1, position = position_dodge(width = 1), color="white",size = 3)+
  coord_flip()+
  labs(title='Non-STEM Employment Rate', subtitle='Top 20')+
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  xlab('Non-STEM Major')+
  ylab('Employment Rate')+
  theme(legend.position = "none")

Inference

Check Conditions

Randomness: The dataset is collected by American Community Survey for recent graduates with age<28 from our nation. The data is presumed to be randomly sampled from the candidates that meet the requirement.

Independence: I assume that each data point is independent of all others and all data points are then grouped by majors.

Chi-Square Tests

First, I will test the independence of unemployment rate with the 172 majors.

H0: All means of unemployment rates of the 172 majors are the same.

H1: One or more mean(s) of unemployment rates of the 172 majors are different.

unemp_rates <- raw_data %>% 
  select(Employed, Unemployed) %>% 
  arrange(Employed) %>% 
  .[-1,]

chisq.test(unemp_rates)

## 
##  Pearson's Chi-squared test
## 
## data:  unemp_rates
## X-squared = 29941, df = 171, p-value < 2.2e-16

The p-value we get from Chi-Squared Test is approximately zero. We therefore reject the null hypothesis and accept the alternative hypothesis that major choice does affects the employment status, hence the unemployment rate.

Hypothesis Test

I will use hypothesis test below to study the relationship between unemployment rate and major categories (STEM or Non-STEM majors).

H0: Mean of unemployment rates for STEM majors and Non-STEM majors are the same.

H1: Mean of unemployment rates for STEM majors and Non-STEM majors are different.

Theoretical:

inference(y = clean_data$Unemployment_rate, 
          x= clean_data$Is_STEM, 
          est = 'mean', 
          type = 'ht', 
          null = 0, 
          alternative = 'twosided', 
          method = 'theoretical',
          conflevel = 0.95)

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Non-STEM = 75, mean_Non-STEM = 0.072, sd_Non-STEM = 0.0278
## n_STEM = 97, mean_STEM = 0.065, sd_STEM = 0.032

## Observed difference between means (Non-STEM-STEM) = 0.007
## 
## H0: mu_Non-STEM - mu_STEM = 0 
## HA: mu_Non-STEM - mu_STEM != 0 
## Standard error = 0.005 
## Test statistic: Z =  1.531 
## p-value =  0.1256

by(clean_data$Unemployment_rate, clean_data$Is_STEM, mean)

## clean_data$Is_STEM: Non-STEM
## [1] 0.07196623
## -------------------------------------------------------- 
## clean_data$Is_STEM: STEM
## [1] 0.06497541

The box-plots shows the medians of unemployment rates of Non-STEM and STEM majors, also with their means of the distributions.

By the results of the hypothesis test, the difference on the unemployment rates of Non-STEM and STEM majors is about 0.7%. The mean unemployment rate of Non-STEM majors is 7.196623% and that of STEM majors is 6.497541%.

The p-value of this theoretical case is 0.1256 which is larger than the critical value 0.05. We do not have sufficient evidence to reject the null hypothesis. Therefore, we accept the null hypothesis, the mean of unemployment rates of Non-STEM and STEM majors are similar, and reject the alternative hypothesis.

One thing has to be aware of is, we accepted the null hypothesis here, but rejected the null hypothesis from the Chi-Squared test above. They show us that, although the average unemployment rates by majors are not all the same, they are similar after categorized into Non-STEM and STEM majors.

As a result, the answer to my first research questions is no. The average unemployment rate of STEM majors is very similar to that of non-STEM majors.

Simulation:

inference(y = clean_data$Unemployment_rate, 
          x= clean_data$Is_STEM, 
          est = 'mean', 
          type = 'ht', 
          null = 0, 
          alternative = 'twosided', 
          method = 'simulation',
          conflevel = 0.95)

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Non-STEM = 75, mean_Non-STEM = 0.072, sd_Non-STEM = 0.0278
## n_STEM = 97, mean_STEM = 0.065, sd_STEM = 0.032

## Observed difference between means (Non-STEM-STEM) = 0.007
## 
## H0: mu_Non-STEM - mu_STEM = 0 
## HA: mu_Non-STEM - mu_STEM != 0

## p-value =  0.1362

mean(c(0.142, 0.1284, 0.1342, 0.1352, 0.139, 0.1348, 0.1366, 0.138, 0.147, 0.1398))

## [1] 0.1375

The p-value generated by simulation would change every time running the code chunk. The p-value of simulation method is mostly around 0.13 to 0.14. I ran the codes 10 times and got 0.142, 0.1284, 0.1342, 0.1352, 0.139, 0.1348, 0.1366, 0.138, 0.147, 0.1398. The average of the 10 values is 0.1375.

The p-value of this simulation case is about 0.1375 which is larger than the critical value 0.05. We do not have sufficient evidence to reject the null hypothesis. Therefore, we accept the null hypothesis, the mean of unemployment rates of Non-STEM and STEM majors are similar.

The p-value results from theoretical method and simulation method are similar from this dataset, but the simulation average is a bit higher.

Confidence Interval

I will calculate the confidence interval of unemployment rate between Non-STEM and STEM majors below.

Theoretical:

inference(y=clean_data$Unemployment_rate, 
          x=clean_data$Is_STEM, 
          est="mean", 
          type = "ci", 
          conflevel=0.95, 
          null=0, 
          alternative = "twosided", 
          method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Non-STEM = 75, mean_Non-STEM = 0.072, sd_Non-STEM = 0.0278
## n_STEM = 97, mean_STEM = 0.065, sd_STEM = 0.032

## Observed difference between means (Non-STEM-STEM) = 0.007
## 
## Standard error = 0.0046 
## 95 % Confidence interval = ( -0.002 , 0.0159 )

The 95% confidence interval I got by using theoretical method for the difference between the unemployment rates of Non-STEM and STEM majors is (-0.002 , 0.0159).

Simulation:

inference(y=clean_data$Unemployment_rate, 
          x=clean_data$Is_STEM, 
          est="mean", 
          type = "ci", 
          conflevel=0.95, 
          null=0, 
          alternative = "twosided", 
          method = "simulation")

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Non-STEM = 75, mean_Non-STEM = 0.072, sd_Non-STEM = 0.0278
## n_STEM = 97, mean_STEM = 0.065, sd_STEM = 0.032

## Observed difference between means (Non-STEM-STEM) = 0.007

## 95 % Bootstrap interval = ( -0.002 , 0.0158 )

The five 95% confidence intervals I got by using simulation method for the difference between the unemployment rates of Non-STEM and STEM majors are ( -0.0018 , 0.016 ), ( -0.0019 , 0.0159 ), ( -0.0016 , 0.0158 ), ( -0.0018 , 0.0159 ), and ( -0.0021 , 0.016 ), which are very similar to our result from theoretical method.

Linear Model

Using recent_grads.csv:

By creating a linear model with Unemployment_rate, Is_STEM and Median income, I have all p-values > 0.05. Therefore, from this dataset, the answer to my second research question is no. Is_STEM and Median are not statistically significant predictors for Unemployment_rate. Unemployment rates cannot be predicted by Is_STEM and median of income from this recent_grads dataset.

model_data <- clean_data %>%
  mutate(Major_category = as.factor(Is_STEM)) 


model <- lm(Unemployment_rate ~Is_STEM + Median, model_data)

summary(model)

## 
## Call:
## lm(formula = Unemployment_rate ~ Is_STEM + Median, data = model_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.072181 -0.018676 -0.000175  0.019055  0.116896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.971e-02  8.426e-03   9.460   <2e-16 ***
## Is_STEMSTEM -4.985e-03  5.054e-03  -0.986    0.325    
## Median      -2.214e-07  2.193e-07  -1.010    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03023 on 169 degrees of freedom
## Multiple R-squared:  0.01905,    Adjusted R-squared:  0.007441 
## F-statistic: 1.641 on 2 and 169 DF,  p-value: 0.1969

As the linear model from recent_grads failed, I will use another dataset named all_ages to perform the same linear model test and investigate the relationship between Unemployment_rate, Is_STEM and Median. This dataset is also collected by ACS and is available from the same source.

Using all_ages.csv:

By creating a linear model with Unemployment_rate, Is_STEM and Median income, I have all p-values < 0.05. Therefore, from this dataset, Is_STEM and Median are statistically significant predictors for Unemployment_rate.

data2 <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv')

clean_data2 <- data2 %>% mutate(Is_STEM = 
                  case_when(
                      Major_category %in% stem_majors ~ 'STEM',
                      str_detect(Major,'TECHNOLOG') ~ 'STEM',
                      TRUE ~ 'Non-STEM')) 
  
  
model2 <- lm(Unemployment_rate ~ Is_STEM + Median, clean_data2)

summary(model2)

## 
## Call:
## lm(formula = Unemployment_rate ~ Is_STEM + Median, data = clean_data2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061515 -0.010160  0.000775  0.009316  0.091269 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.652e-02  5.652e-03  13.540   <2e-16 ***
## Is_STEMSTEM -7.884e-03  3.216e-03  -2.451   0.0152 *  
## Median      -2.587e-07  1.087e-07  -2.380   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01807 on 170 degrees of freedom
## Multiple R-squared:  0.1222, Adjusted R-squared:  0.1119 
## F-statistic: 11.83 on 2 and 170 DF,  p-value: 1.544e-05

To test if the p-values and parameter estimates above should be trusted, I will verify the conditions of this model below.

Linearity and Constant Variability

The residuals dispersed around y=0 and ranges mostly between (-0.05, 0.05). Therefore, both conditions are met.

Nearly Normal Residuals

The histogram shows a unimodal and nearly normal distribution. The normal probability plot of the residuals mostly lines on the normal line, only a few outliers line above and below the normal line at both ends. Therefore, this condition is met.

boxplot(model2$residuals ~ clean_data2$Is_STEM, col="cyan2")

plot(model2$residuals ~ jitter(clean_data2$Median), col="cyan2")
abline(h=0, lty=2)

hist(model2$residuals, breaks=30, col="cyan2")

qqnorm(model2$residuals, col="cyan2")
qqline(model2$residuals)

Conclusions

We find that choices in college majors has a significant effect on the unemployment rate of recent graduates. However, the significant effect is on individual majors, but not significant when categorizing in Non-STEM vs STEM majors. It may be because “Business” is categorized as Non-STEM major, and that may have pulled up the employment rate of other Non-STEM majors to a similar average of STEM majors.

Unemployment rates cannot be predicted by major type and median income for recent graduates (age<28), but it can be predicted for people of all ages. Some recent graduates may suffer a short period of time in looking for jobs, but some may already have offers from their internships or intervals. Students efforts put in job searching highly affected the result, which casuing difficulties to predict their unemployment rate. However, on the other side, unemployment rate can be predicted for people of all ages. The difficulties mentioned above can be generalized by people with a few years of experience, hence more stable employment status.

If the same sample can be tracked down through years, we may have some future researchs on the relationship between their education status (bachelor, master, PhD), median income, and unemployment rate.

Data 606 Data Project

Fan Xu

12/08/2019