DATA 606 Data Final Project

Whats it Worth: The Economic Value of College Education

Setup the Clean Environment

All done behind the scene. :)

Data Preparation

Load Libraries

# Load Libraries
library("DT")
library("knitr")
library("dplyr")
library("tidyr")
library("stringr")
library("psych")
library("ggplot2")
library("MASS")
library("car")

Load All Student Data

# load data

# All Students
all_age_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
all.age <- read.csv(all_age_url, sep=",",  header=T, stringsAsFactors = FALSE)
all.ages <- all.age %>% 
    tbl_df() %>% 
    arrange(Major_category)

head(all.age, 2)

##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
##                    Major_category  Total Employed
## 1 Agriculture & Natural Resources 128148    90245
## 2 Agriculture & Natural Resources  95326    76865
##   Employed_full_time_year_round Unemployed Unemployment_rate Median P25th
## 1                         74078       2423        0.02614711  50000 34000
## 2                         64240       2266        0.02863606  54000 36000
##   P75th
## 1 80000
## 2 80000

Load Graduate Student Data

# Grad Students
grad_age_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv"

grad.age <- read.csv(grad_age_url, sep=",",  header=T, stringsAsFactors = FALSE)
grad.age <- grad.age %>% 
    tbl_df() %>% 
    arrange(Major_category)

head(grad.age, 2)

## # A tibble: 2 x 22
##   Major~ Major   Major_~ Grad_~ Grad~ Grad~ Grad~ Grad~ Grad_~ Grad~ Grad~
##    <int> <chr>   <chr>    <int> <int> <int> <int> <int>  <dbl> <dbl> <int>
## 1   1101 AGRICU~ Agricu~  17488   386 13104 11207   473 0.0348 67000 41600
## 2   1100 GENERA~ Agricu~  44306   764 28930 23024   874 0.0293 68000 45000
## # ... with 11 more variables: Grad_P75 <dbl>, Nongrad_total <int>,
## #   Nongrad_employed <int>, Nongrad_full_time_year_round <int>,
## #   Nongrad_unemployed <int>, Nongrad_unemployment_rate <dbl>,
## #   Nongrad_median <dbl>, Nongrad_P25 <int>, Nongrad_P75 <dbl>, Grad_share
## #   <dbl>, Grad_premium <dbl>

Load Under Graduate Student Data

# UnderGrad Students
rcnt_grad_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv"
rctgrad.age <- read.csv(rcnt_grad_url, sep=",",  header=T, stringsAsFactors = FALSE)
rctgrad.age <- rctgrad.age %>% 
    tbl_df() %>% 
    arrange(Major_category)

head(rctgrad.age, 2)

## # A tibble: 2 x 21
##    Rank Majo~ Major Total   Men Women Majo~ Share~ Samp~ Empl~ Full~ Part~
##   <int> <int> <chr> <int> <int> <int> <chr>  <dbl> <int> <int> <int> <int>
## 1    22  1104 FOOD~    NA    NA    NA Agri~ NA        36  3149  2558  1121
## 2    64  1101 AGRI~ 14240  9658  4582 Agri~  0.322   273 12323 11119  2196
## # ... with 9 more variables: Full_time_year_round <int>, Unemployed <int>,
## #   Unemployment_rate <dbl>, Median <int>, P25th <int>, P75th <int>,
## #   College_jobs <int>, Non_college_jobs <int>, Low_wage_jobs <int>

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

College Majors The Economic Guide to picking a College Major

I come from India. There is a joke in my country which ges like this, “Indian parents give their kids FULL freedom to select a career of their choice, as long as, it is an engineer, doctor or a lawyer!”

I have always been curious on how does selection of a college major influence a person’s success? I want to examine which fields can guarantee financial success by performing hypothesis testing after analyzing the employability and median incomes.

Cases

What are the cases, and how many are there?

All_ages: This data represents a case of both undergrads and grad students from 173 majors offered by colleges in USA. Grad Students: This data is subset of above and each case represents majoes offered from list of 173 majors offered by colleges in USA for grad students over 25+ years of age. Under Grad Students: This data is subset of above and each case represents majoes offered from list of 173 majors offered by colleges in USA for undergrad students under 28 years of age.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you are comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Various Summaries

summary(all.age)

##    Major_code      Major           Major_category         Total        
##  Min.   :1100   Length:173         Length:173         Min.   :   2396  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  24280  
##  Median :3608   Mode  :character   Mode  :character   Median :  75791  
##  Mean   :3880                                         Mean   : 230257  
##  3rd Qu.:5503                                         3rd Qu.: 205763  
##  Max.   :6403                                         Max.   :3123510  
##     Employed       Employed_full_time_year_round   Unemployed    
##  Min.   :   1492   Min.   :   1093               Min.   :     0  
##  1st Qu.:  17281   1st Qu.:  12722               1st Qu.:  1101  
##  Median :  56564   Median :  39613               Median :  3619  
##  Mean   : 166162   Mean   : 126308               Mean   :  9725  
##  3rd Qu.: 142879   3rd Qu.: 111025               3rd Qu.:  8862  
##  Max.   :2354398   Max.   :1939384               Max.   :147261  
##  Unemployment_rate     Median           P25th           P75th       
##  Min.   :0.00000   Min.   : 35000   Min.   :24900   Min.   : 45800  
##  1st Qu.:0.04626   1st Qu.: 46000   1st Qu.:32000   1st Qu.: 70000  
##  Median :0.05472   Median : 53000   Median :36000   Median : 80000  
##  Mean   :0.05736   Mean   : 56816   Mean   :38697   Mean   : 82506  
##  3rd Qu.:0.06904   3rd Qu.: 65000   3rd Qu.:42000   3rd Qu.: 95000  
##  Max.   :0.15615   Max.   :125000   Max.   :78000   Max.   :210000

summary(grad.age)

##    Major_code      Major           Major_category       Grad_total     
##  Min.   :1100   Length:173         Length:173         Min.   :   1542  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  15284  
##  Median :3608   Mode  :character   Mode  :character   Median :  37872  
##  Mean   :3880                                         Mean   : 127672  
##  3rd Qu.:5503                                         3rd Qu.: 148255  
##  Max.   :6403                                         Max.   :1184158  
##  Grad_sample_size Grad_employed    Grad_full_time_year_round
##  Min.   :   22    Min.   :  1008   Min.   :   770           
##  1st Qu.:  314    1st Qu.: 12659   1st Qu.:  9894           
##  Median :  688    Median : 28930   Median : 22523           
##  Mean   : 2251    Mean   : 94037   Mean   : 72861           
##  3rd Qu.: 2528    3rd Qu.:109944   3rd Qu.: 80794           
##  Max.   :21994    Max.   :915341   Max.   :703347           
##  Grad_unemployed Grad_unemployment_rate  Grad_median        Grad_P25    
##  Min.   :    0   Min.   :0.00000        Min.   : 47000   Min.   :24500  
##  1st Qu.:  453   1st Qu.:0.02607        1st Qu.: 65000   1st Qu.:45000  
##  Median : 1179   Median :0.03665        Median : 75000   Median :50000  
##  Mean   : 3506   Mean   :0.03934        Mean   : 76756   Mean   :52597  
##  3rd Qu.: 3329   3rd Qu.:0.04805        3rd Qu.: 90000   3rd Qu.:60000  
##  Max.   :35718   Max.   :0.13851        Max.   :135000   Max.   :85000  
##     Grad_P75      Nongrad_total     Nongrad_employed 
##  Min.   : 65000   Min.   :   2232   Min.   :   1328  
##  1st Qu.: 93000   1st Qu.:  20564   1st Qu.:  15914  
##  Median :108000   Median :  68993   Median :  50092  
##  Mean   :112087   Mean   : 214720   Mean   : 154554  
##  3rd Qu.:130000   3rd Qu.: 184971   3rd Qu.: 129179  
##  Max.   :294000   Max.   :2996892   Max.   :2253649  
##  Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate
##  Min.   :    980              Min.   :     0     Min.   :0.00000          
##  1st Qu.:  11755              1st Qu.:   880     1st Qu.:0.04198          
##  Median :  38384              Median :  3157     Median :0.05103          
##  Mean   : 120737              Mean   :  8486     Mean   :0.05395          
##  3rd Qu.: 103629              3rd Qu.:  7409     3rd Qu.:0.06439          
##  Max.   :1882507              Max.   :136978     Max.   :0.16091          
##  Nongrad_median    Nongrad_P25     Nongrad_P75       Grad_share     
##  Min.   : 37000   Min.   :25000   Min.   : 48000   Min.   :0.09632  
##  1st Qu.: 48700   1st Qu.:34000   1st Qu.: 72000   1st Qu.:0.26757  
##  Median : 55000   Median :38000   Median : 80000   Median :0.39875  
##  Mean   : 58584   Mean   :40078   Mean   : 84333   Mean   :0.40059  
##  3rd Qu.: 65000   3rd Qu.:44000   3rd Qu.: 97000   3rd Qu.:0.49912  
##  Max.   :126000   Max.   :80000   Max.   :215000   Max.   :0.93117  
##   Grad_premium    
##  Min.   :-0.0250  
##  1st Qu.: 0.2308  
##  Median : 0.3208  
##  Mean   : 0.3285  
##  3rd Qu.: 0.4000  
##  Max.   : 1.6471

summary(rctgrad.age)

##       Rank       Major_code      Major               Total       
##  Min.   :  1   Min.   :1100   Length:173         Min.   :   124  
##  1st Qu.: 44   1st Qu.:2403   Class :character   1st Qu.:  4550  
##  Median : 87   Median :3608   Mode  :character   Median : 15104  
##  Mean   : 87   Mean   :3880                      Mean   : 39370  
##  3rd Qu.:130   3rd Qu.:5503                      3rd Qu.: 38910  
##  Max.   :173   Max.   :6403                      Max.   :393735  
##                                                  NA's   :1       
##       Men             Women        Major_category       ShareWomen    
##  Min.   :   119   Min.   :     0   Length:173         Min.   :0.0000  
##  1st Qu.:  2178   1st Qu.:  1778   Class :character   1st Qu.:0.3360  
##  Median :  5434   Median :  8386   Mode  :character   Median :0.5340  
##  Mean   : 16723   Mean   : 22647                      Mean   :0.5222  
##  3rd Qu.: 14631   3rd Qu.: 22554                      3rd Qu.:0.7033  
##  Max.   :173809   Max.   :307087                      Max.   :0.9690  
##  NA's   :1        NA's   :1                           NA's   :1       
##   Sample_size        Employed        Full_time        Part_time     
##  Min.   :   2.0   Min.   :     0   Min.   :   111   Min.   :     0  
##  1st Qu.:  39.0   1st Qu.:  3608   1st Qu.:  3154   1st Qu.:  1030  
##  Median : 130.0   Median : 11797   Median : 10048   Median :  3299  
##  Mean   : 356.1   Mean   : 31193   Mean   : 26029   Mean   :  8832  
##  3rd Qu.: 338.0   3rd Qu.: 31433   3rd Qu.: 25147   3rd Qu.:  9948  
##  Max.   :4212.0   Max.   :307933   Max.   :251540   Max.   :115172  
##                                                                     
##  Full_time_year_round   Unemployed    Unemployment_rate     Median      
##  Min.   :   111       Min.   :    0   Min.   :0.00000   Min.   : 22000  
##  1st Qu.:  2453       1st Qu.:  304   1st Qu.:0.05031   1st Qu.: 33000  
##  Median :  7413       Median :  893   Median :0.06796   Median : 36000  
##  Mean   : 19694       Mean   : 2416   Mean   :0.06819   Mean   : 40151  
##  3rd Qu.: 16891       3rd Qu.: 2393   3rd Qu.:0.08756   3rd Qu.: 45000  
##  Max.   :199897       Max.   :28169   Max.   :0.17723   Max.   :110000  
##                                                                         
##      P25th           P75th         College_jobs    Non_college_jobs
##  Min.   :18500   Min.   : 22000   Min.   :     0   Min.   :     0  
##  1st Qu.:24000   1st Qu.: 42000   1st Qu.:  1675   1st Qu.:  1591  
##  Median :27000   Median : 47000   Median :  4390   Median :  4595  
##  Mean   :29501   Mean   : 51494   Mean   : 12323   Mean   : 13284  
##  3rd Qu.:33000   3rd Qu.: 60000   3rd Qu.: 14444   3rd Qu.: 11783  
##  Max.   :95000   Max.   :125000   Max.   :151643   Max.   :148395  
##                                                                    
##  Low_wage_jobs  
##  Min.   :    0  
##  1st Qu.:  340  
##  Median : 1231  
##  Mean   : 3859  
##  3rd Qu.: 3466  
##  Max.   :48207  
##

All student median income

hist(all.age$Median, main = "All Student Median Income", xlab = "Median Incomes (USD)", col = "blue")

Combined unemployement analysis

combine.unemployment <- cbind(all.age$Unemployment_rate, rctgrad.age$Unemployment_rate, grad.age$Grad_unemployment_rate)

barplot(combine.unemployment/nrow(combine.unemployment), names.arg = c("All", "Recent Grad", "Grad Student"), xlab = "Unemployment Rate", col = heat.colors(nrow(combine.unemployment)))

The above graphs already give us a flavor of the data that graduate students clearly have much higher median income compared to students who recently completed under grads.

Exploring the data

unempl <- cbind(all.age$Unemployment_rate, rctgrad.age$Unemployment_rate, grad.age$Grad_unemployment_rate)

boxplot(unempl,names = c("All", "Recent Grad", "Grad Student"), ylab = "Unemployment Rate")

summary(rctgrad.age$Median)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40151   45000  110000

hist(rctgrad.age$Median, main = "Histogram for Median Income Recent Grads", xlab = "Median Income by Major Recent Grads (USD)", col = "dark blue")

summary(grad.age$Grad_median)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   47000   65000   75000   76756   90000  135000

hist(grad.age$Grad_median, main = "Histogram for Median Income Grad Students", xlab = "Median Income by Major Grad Student (USD)", col = "dark blue")

medsal <- cbind(all.age$Median, rctgrad.age$Median, grad.age$Grad_median)

boxplot(medsal, names = c("All", "Recent Grad", "Grad Student"), ylab = "Median Salary USD")

As seen above the for graduates there are greater amount of outliers. So graduate degree helps with much greater salaries.

Chi-Squared Tests for Independance for Employment Status

For all Ages

all_age_contin <- all.age %>% dplyr::select(Major, Employed, Unemployed) # For user-freindliness we'll pull major, number employed, number unemployed. 
head(all_age_contin)

##                                   Major Employed Unemployed
## 1                   GENERAL AGRICULTURE    90245       2423
## 2 AGRICULTURE PRODUCTION AND MANAGEMENT    76865       2266
## 3                AGRICULTURAL ECONOMICS    26321        821
## 4                       ANIMAL SCIENCES    81177       3619
## 5                          FOOD SCIENCE    17281        894
## 6            PLANT SCIENCE AND AGRONOMY    63043       2070

chisq.test(all_age_contin[,-1]) #We remove the major names for the chi-squared test

## 
##  Pearson's Chi-squared test
## 
## data:  all_age_contin[, -1]
## X-squared = 96644, df = 172, p-value < 2.2e-16

Since the p-value is less than 0.05, we can reject the null hypothesis that the choice of major does not affects employment status, and we accept the alternative hypothesis that choice of major does affect employment status in the all ages category.

For Graduate Students

grd_st_contin <- grad.age %>% dplyr::select(Major, Grad_employed, Grad_unemployed)# For user-freindliness we'll pull major, number employed, number unemployed. 

head(grd_st_contin)

## # A tibble: 6 x 3
##   Major                                 Grad_employed Grad_unemployed
##   <chr>                                         <int>           <int>
## 1 AGRICULTURE PRODUCTION AND MANAGEMENT         13104             473
## 2 GENERAL AGRICULTURE                           28930             874
## 3 FORESTRY                                      16831             725
## 4 NATURAL RESOURCES MANAGEMENT                  23394             711
## 5 PLANT SCIENCE AND AGRONOMY                    22782             735
## 6 AGRICULTURAL ECONOMICS                        10592             216

chisq.test(grd_st_contin[,-1]) #We remove the major names for the chi-squared test

## 
##  Pearson's Chi-squared test
## 
## data:  grd_st_contin[, -1]
## X-squared = 62013, df = 172, p-value < 2.2e-16

As the p<0.05, we reject the null hypothesis and accept the alternative hypothesis that major choice at the grad level affects employment status.

For Recent Graduate Students

rct_gr_contin <- rctgrad.age %>% dplyr::select(Major,Employed,Unemployed) %>% filter(Major != "MILITARY TECHNOLOGIES" ) # military technology had 0 in both employed and unemployed columns, was excluded.
rct_gr_contin

## # A tibble: 172 x 3
##    Major                                 Employed Unemployed
##    <chr>                                    <int>      <int>
##  1 FOOD SCIENCE                              3149        338
##  2 AGRICULTURE PRODUCTION AND MANAGEMENT    12323        649
##  3 GENERAL AGRICULTURE                       8884        178
##  4 AGRICULTURAL ECONOMICS                    2174        182
##  5 NATURAL RESOURCES MANAGEMENT             11797        842
##  6 FORESTRY                                  3007        322
##  7 SOIL SCIENCE                               613          0
##  8 PLANT SCIENCE AND AGRONOMY                6594        314
##  9 ANIMAL SCIENCES                          17112        917
## 10 MISCELLANEOUS AGRICULTURE                 1290         82
## # ... with 162 more rows

chisq.test(rct_gr_contin[,-1]) #We remove the major names for the chi-squared test

## 
##  Pearson's Chi-squared test
## 
## data:  rct_gr_contin[, -1]
## X-squared = 29941, df = 171, p-value < 2.2e-16

Here too we have to reject the null hypothesis

Student’s T and Kolmogorov-Smirnov Tests - Median Salary

As we stated in the Data section, exploration of the median salary data in the appendix shows that quantitative analysis majors, the STEM majors, appear to have more earning potential than qualitative analysis majors, such as Liberal Arts. Since median salary is a numerical measurement, it is appropriate to use a Student’s t-test7 or a Kolmogorov-Smirnov8 test to compare similarity between data sets. The Student’s t-test is a parametric test that compare’s against the t distribution. The Kolmogorov-Smirnov is a non-parametric test, in that it does not assume the survey data is drawn from a population with a given distribution, instead it measures likelihood of similarity by comparing the biggest difference in to data set’s continuous probability distribution. Since the salary data has a right-skew across all attainment levels, adding a non-parametric test will increase the robustness of this analysis.

To make these comparisons, we must bare in mind that we have (14 major categories x 3 attainment levels) 42 categories that have to be combined in groups of 2 for a total of $C(42,2) = \frac{42!}{(2!*42!)} = 861$ combinations. This is prohibitively long given the time constraints for this project. Therefore, We will analyze 4 major categories from the all ages set to bring us to $C(4,2) = \frac{4!}{(2!*2!)} = 6$ combinations. These major categories are, Engineering, Physical Sciences, Liberal Arts, and Psychology & Social Work.

Engineering vs Physical Sciences

The Null hypothesis is that there is no difference between median salaries of Engineering majors and Physical Science Majors. Initial two-sided tests, that only check that the distributions are different, and not that one is greater or less than the other, showed significance in all cases. We show below the results of single sided tests to definitely say that median salary of one degree category is greater than the other.

all_ages_eng =  filter(all.age, grepl("ENGI",all.age$Major))
all_ages_sci =  filter(all.age, grepl("PHYS",all.age$Major))

boxplot(all_ages_eng$Median, all_ages_sci$Median, names = c("Engineering", "Physical Sciences"), ylab = "Median Salary USD")

t.test(all_ages_eng$Median, all_ages_sci$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_eng$Median and all_ages_sci$Median
## t = 2.6464, df = 11.32, p-value = 0.01113
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5147.67     Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  62800.00

ks.test(all_ages_eng$Median, all_ages_sci$Median, alternative = "less") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_eng$Median, all_ages_sci$Median, alternative =
## "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_eng$Median and all_ages_sci$Median
## D^- = 0.46154, p-value = 0.0738
## alternative hypothesis: the CDF of x lies below that of y

Liberal Arts vs. Psycology and Social Work.

The Null Hypothesis and Alternative are similar to above, albeit with Liberal Arts and Psycology and Social as major categories.

all_ages_la=filter(all.age, grepl("ART",all.age$Major))
all_ages_psy=filter(all.age, grepl("PSYC",all.age$Major))

boxplot(all_ages_la$Median, all_ages_psy$Median, names = c("Liberal Arts", "Psycology & Social Work"), ylab = "Median Salary USD")

t.test(all_ages_la$Median, all_ages_psy$Median, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_la$Median and all_ages_psy$Median
## t = -0.46875, df = 14.611, p-value = 0.6462
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8660.675  5544.008
## sample estimates:
## mean of x mean of y 
##  45441.67  47000.00

ks.test(all_ages_la$Median, all_ages_psy$Median, alternative = "two.sided") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_la$Median, all_ages_psy$Median, alternative =
## "two.sided"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_la$Median and all_ages_psy$Median
## D = 0.25, p-value = 0.9251
## alternative hypothesis: two-sided

As is, there is no statistical significant difference between Liberal Arts and Psycology & Social Work in either test. However, the Industrial and Organizational Psycology major is an outlier. Below we removed the outlier major and performed the analysis again.

psy_no_outl <- all_ages_psy %>% dplyr::select(Median) %>% filter(Median != max(Median)) #this removed the high outlier.

boxplot(all_ages_la$Median, psy_no_outl$Median, names = c("Liberal Arts", "Psycology & Social Work"), ylab = "Median Salary USD")

t.test(all_ages_la$Median, psy_no_outl$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_la$Median and psy_no_outl$Median
## t = 0.21743, df = 16.627, p-value = 0.4153
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -4098.13      Inf
## sample estimates:
## mean of x mean of y 
##  45441.67  44857.14

ks.test(all_ages_la$Median, psy_no_outl$Median, alternative = "less") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_la$Median, psy_no_outl$Median, alternative =
## "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_la$Median and psy_no_outl$Median
## D^- = 0.083333, p-value = 0.9404
## alternative hypothesis: the CDF of x lies below that of y

shapiro.test(all_ages_la$Median) #Signifcance in KS test disagreed with the T-test. A Normality test is performed as a tie breaker.

## 
##  Shapiro-Wilk normality test
## 
## data:  all_ages_la$Median
## W = 0.79833, p-value = 0.008968

shapiro.test(psy_no_outl$Median) #Note that the T-test assumes a t-distribution which becomes a Normal distribution at high 'n'

## 
##  Shapiro-Wilk normality test
## 
## data:  psy_no_outl$Median
## W = 0.91269, p-value = 0.4148

Upon removing the outlier from the Psycology and Social Work set, we get a significant difference of p = 0.01653, but a non-significant difference of p = 0.2231. However the Shapiro-Wilk test for normality fails to reject the null hypothesis of that test, that the data comes from a Normally distributed population. Since the Student’s T-test tends to the Normal distribution for high N, we can trust the result of the student t-test.

The median salary of Liberal Arts majors is greater than Psycology and Social Work majors at the 95% confidence level, once the outlying major in Psycology and Social Work is removed.

Engineering vs Liberal Arts

This time we repeat the same Null and Alternative hypotheses with Engineering and Liberal Arts.

boxplot(all_ages_eng$Median, all_ages_la$Median, names = c("Engineering", "Liberal Arts"), ylab = "Median Salary USD")

t.test(all_ages_eng$Median, all_ages_la$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_eng$Median and all_ages_la$Median
## t = 9.522, df = 35.797, p-value = 1.197e-11
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  27385.87      Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  45441.67

ks.test(all_ages_eng$Median, all_ages_la$Median, alternative = "less") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_eng$Median, all_ages_la$Median, alternative =
## "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_eng$Median and all_ages_la$Median
## D^- = 0.91667, p-value = 1.017e-06
## alternative hypothesis: the CDF of x lies below that of y

The median salary of Engineering majors is higher than that of Liberal Arts majors at the 95% confidence level.

Engineering vs Psycology & Social Work

boxplot(all_ages_eng$Median, all_ages_psy$Median, names = c("Engineering", "Psycology & Social Work"), ylab = "Median Salary USD")

t.test(all_ages_eng$Median, all_ages_psy$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_eng$Median and all_ages_psy$Median
## t = 8.2049, df = 23.827, p-value = 1.062e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  25112.32      Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  47000.00

ks.test(all_ages_eng$Median, all_ages_psy$Median, alternative = "less") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_eng$Median, all_ages_psy$Median, alternative =
## "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_eng$Median and all_ages_psy$Median
## D^- = 0.92308, p-value = 2.967e-05
## alternative hypothesis: the CDF of x lies below that of y

Liberal Arts vs Physical Sciences

boxplot(all_ages_la$Median, all_ages_sci$Median, names = c("Liberal Arts", "Physical Sciences"), ylab = "Median Salary USD")

t.test(all_ages_la$Median, all_ages_sci$Median, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  all_ages_la$Median and all_ages_sci$Median
## t = -3.0526, df = 9.093, p-value = 0.006787
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -6946.71
## sample estimates:
## mean of x mean of y 
##  45441.67  62800.00

ks.test(all_ages_la$Median, all_ages_sci$Median, alternative = "greater") #KS test has opposite sign convention than t test

## Warning in ks.test(all_ages_la$Median, all_ages_sci$Median, alternative =
## "greater"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  all_ages_la$Median and all_ages_sci$Median
## D^+ = 0.70833, p-value = 0.008094
## alternative hypothesis: the CDF of x lies above that of y

Summary of T-tests an KS tests

In terms of median pay the ranking is as follows: 1. Engineering 2. Physical Science 3. Liberal Arts 4. Psycology and Social Work Additionally, the Industrial and Organizational Psycology Major is similar in pay to Physical Sciences.

Linear Model - Unemployment rate vs median salary

Job market pressure can have an impact on both median salary and unemployment rate. If a field has low demand but high supply this can depress the salary and increase the unemployment rate. Conversely, a high demand/low supply field will see increased salaries and decreased unemployment rates. Another effect to consider is that people in over-subscribed field may spend a greater time looking for a job, which would also decrease median salary as they may be unemployed or underemployed during the job hunt. This effect could show in the data as a correlation between unemployment rate and salary.

To test if there is a connection between unemployment rate and median salary, we will take the “all_ages” data set and create linear regression models. If the residuals of the model do not show the necessary behavior of Normal Distribution and Constant Variance, we will perform a Box-Cox transformation on the data to get an exponential factor to improve the model.

fit1<-lm(all.age$Median ~ all.age$Unemployment_rate)
summary(fit1)

## 
## Call:
## lm(formula = all.age$Median ~ all.age$Unemployment_rate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23370  -8995  -3272   8079  64676 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  70097       3380  20.738  < 2e-16 ***
## all.age$Unemployment_rate  -231551      55906  -4.142 5.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14060 on 171 degrees of freedom
## Multiple R-squared:  0.09117,    Adjusted R-squared:  0.08586 
## F-statistic: 17.15 on 1 and 171 DF,  p-value: 5.406e-05

ggplot(all.age, aes(x = Unemployment_rate, y = Median)) +
  geom_point(color = 'blue')+
  geom_smooth(method = "lm", formula = y~x)

hist(resid(fit1))

plot(fitted(fit1), resid(fit1))

myt <- boxcox(fit1)

myt_df <- as.data.frame(myt)
optimal_lambda = myt_df[which.max(myt$y),1]
optimal_lambda

## [1] -1.070707

fit2 <- lm(all.age$Median^optimal_lambda ~ all.age$Unemployment_rate)
summary(fit2)

## 
## Call:
## lm(formula = all.age$Median^optimal_lambda ~ all.age$Unemployment_rate)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.646e-06 -1.475e-06  2.614e-07  1.295e-06  5.141e-06 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.755e-06  4.666e-07  14.476  < 2e-16 ***
## all.age$Unemployment_rate 3.272e-05  7.718e-06   4.239 3.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.941e-06 on 171 degrees of freedom
## Multiple R-squared:  0.09509,    Adjusted R-squared:  0.0898 
## F-statistic: 17.97 on 1 and 171 DF,  p-value: 3.664e-05

hist(resid(fit2))

plot(fitted(fit2), resid(fit2))

qqnorm(resid(fit2))
qqline(resid(fit2))

all_ages <- all.ages %>% mutate(transMedian  = Median^optimal_lambda)
head(all_ages)

## # A tibble: 6 x 12
##   Major~ Major   Major~  Total Empl~ Emplo~ Unem~ Unemp~ Medi~ P25th P75th
##    <int> <chr>   <chr>   <int> <int>  <int> <int>  <dbl> <int> <int> <dbl>
## 1   1100 GENERA~ Agric~ 128148 90245  74078  2423 0.0261 50000 34000 80000
## 2   1101 AGRICU~ Agric~  95326 76865  64240  2266 0.0286 54000 36000 80000
## 3   1102 AGRICU~ Agric~  33955 26321  22810   821 0.0302 63000 40000 98000
## 4   1103 ANIMAL~ Agric~ 103549 81177  64937  3619 0.0427 46000 30000 72000
## 5   1104 FOOD S~ Agric~  24280 17281  12722   894 0.0492 62000 38500 90000
## 6   1105 PLANT ~ Agric~  79409 63043  51077  2070 0.0318 50000 35000 75000
## # ... with 1 more variable: transMedian <dbl>

ggplot(all_ages, aes(x = Unemployment_rate, y = transMedian)) +
  geom_point(color = 'blueviolet')+
  geom_smooth(method = "lm", formula = y~x)

all_ages_no_outlr <- all_ages %>% filter(Unemployment_rate != max(Unemployment_rate) & Unemployment_rate != 0)
fit3 <- lm(all_ages_no_outlr$transMedian ~ all_ages_no_outlr$Unemployment_rate)
summary(fit3)

## 
## Call:
## lm(formula = all_ages_no_outlr$transMedian ~ all_ages_no_outlr$Unemployment_rate)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.616e-06 -1.475e-06  1.990e-07  1.290e-06  5.153e-06 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         6.611e-06  5.382e-07  12.285  < 2e-16
## all_ages_no_outlr$Unemployment_rate 3.539e-05  8.999e-06   3.933 0.000122
##                                        
## (Intercept)                         ***
## all_ages_no_outlr$Unemployment_rate ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.949e-06 on 168 degrees of freedom
## Multiple R-squared:  0.08432,    Adjusted R-squared:  0.07887 
## F-statistic: 15.47 on 1 and 168 DF,  p-value: 0.0001225

hist(resid(fit3))

plot(fitted(fit3), resid(fit3))

qqnorm(resid(fit3))
qqline(resid(fit3))

ggplot(all_ages_no_outlr, aes(x = Unemployment_rate, y = transMedian)) +
  geom_point(color = 'firebrick')+
  geom_smooth(method = "lm", formula = y~x)

mjr_umploy <- all_ages  %>% dplyr::select(Major,Unemployment_rate) %>% arrange(Unemployment_rate)
head(mjr_umploy, 10)

## # A tibble: 10 x 2
##    Major                                      Unemployment_rate
##    <chr>                                                  <dbl>
##  1 EDUCATIONAL ADMINISTRATION AND SUPERVISION            0     
##  2 GEOLOGICAL AND GEOPHYSICAL ENGINEERING                0     
##  3 PHARMACOLOGY                                          0.0161
##  4 MATERIALS SCIENCE                                     0.0223
##  5 MATHEMATICS AND COMPUTER SCIENCE                      0.0249
##  6 GENERAL AGRICULTURE                                   0.0261
##  7 TREATMENT THERAPY PROFESSIONS                         0.0263
##  8 NURSING                                               0.0268
##  9 AGRICULTURE PRODUCTION AND MANAGEMENT                 0.0286
## 10 AGRICULTURAL ECONOMICS                                0.0302

tail(mjr_umploy, 10)

## # A tibble: 10 x 2
##    Major                                    Unemployment_rate
##    <chr>                                                <dbl>
##  1 ARCHITECTURE                                        0.0860
##  2 ASTRONOMY AND ASTROPHYSICS                          0.0860
##  3 SOCIAL PSYCHOLOGY                                   0.0873
##  4 COMPUTER PROGRAMMING AND DATA PROCESSING            0.0903
##  5 VISUAL AND PERFORMING ARTS                          0.0947
##  6 LIBRARY SCIENCE                                     0.0948
##  7 SCHOOL STUDENT COUNSELING                           0.102 
##  8 MILITARY TECHNOLOGIES                               0.102 
##  9 CLINICAL PSYCHOLOGY                                 0.103 
## 10 MISCELLANEOUS FINE ARTS                             0.156

mjr_salary <- all_ages  %>% dplyr::select(Major,Median) %>% arrange(Median)
head(mjr_salary, 10)

## # A tibble: 10 x 2
##    Major                                     Median
##    <chr>                                      <int>
##  1 NEUROSCIENCE                               35000
##  2 EARLY CHILDHOOD EDUCATION                  35300
##  3 STUDIO ARTS                                37600
##  4 HUMAN SERVICES AND COMMUNITY ORGANIZATION  38000
##  5 COUNSELING PSYCHOLOGY                      39000
##  6 VISUAL AND PERFORMING ARTS                 40000
##  7 ELEMENTARY EDUCATION                       40000
##  8 TEACHER EDUCATION: MULTIPLE LEVELS         40000
##  9 LIBRARY SCIENCE                            40000
## 10 COMPOSITION AND RHETORIC                   40000

tail(mjr_salary, 10)

## # A tibble: 10 x 2
##    Major                                               Median
##    <chr>                                                <int>
##  1 GEOLOGICAL AND GEOPHYSICAL ENGINEERING               85000
##  2 CHEMICAL ENGINEERING                                 86000
##  3 ELECTRICAL ENGINEERING                               88000
##  4 MATHEMATICS AND COMPUTER SCIENCE                     92000
##  5 MINING AND MINERAL ENGINEERING                       92000
##  6 NUCLEAR ENGINEERING                                  95000
##  7 METALLURGICAL ENGINEERING                            96000
##  8 NAVAL ARCHITECTURE AND MARINE ENGINEERING            97000
##  9 PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION 106000
## 10 PETROLEUM ENGINEERING                               125000

Initially the data had marginal behavior regarding the residuals. The Box cox transformation did make the residuals Normal and Homoskedacstic. In that regard the transformed model is fit to make predictions. Both the initial slope of the linear regression model of -231551 and the Box Cox exponent of -1.07 shows that unemployment rate and median salary are inversely related. That is low unemployment rates tend to have higher median salaries and high unemployment rates tend to lower salaries. This relationship is statistically significant, with a p-value of 0.0001225, even after influencing outliers were removed. However, the effect is weak with an $R^2$ of 0.08432 after outleirs are removed. This means that only about 8.432% of the variability of median salary can be explained by unemployment rate. We suggest to students who are researching the prospects of college majors is to treat underemployment rates and salary statistics separately. Do not just go off of advise like, “You’ll make a mint in this field” or “They’re hiring a lot of people in that field”. It does no good if a student accrues $100,000 in debt to be virtually guaranteed a job where they can’t pay the debt off, or they could pay it off if they get a job in that field, but the chances of that are small.

Conclusions

We find that choice in college major has a significant effect on median salary and unemployment rate. This effect is seen at all age levels. Higher salaries and lower unemployment tend to favor STEM majors. Gender balance of majors also plays a significant effect on median salary. These findings that STEM and Gender affect median salary seem to be interrelated as the STEM majors tend to be male majority. There is a statistically, but not necessarily practically, significance between unemployment rate and median pay.