All done behind the scene. :)
# Load Libraries
library("DT")
library("knitr")
library("dplyr")
library("tidyr")
library("stringr")
library("psych")
library("ggplot2")
library("MASS")
library("car")# load data
# All Students
all_age_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
all.age <- read.csv(all_age_url, sep=",", header=T, stringsAsFactors = FALSE)
all.ages <- all.age %>%
tbl_df() %>%
arrange(Major_category)
head(all.age, 2)## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## Major_category Total Employed
## 1 Agriculture & Natural Resources 128148 90245
## 2 Agriculture & Natural Resources 95326 76865
## Employed_full_time_year_round Unemployed Unemployment_rate Median P25th
## 1 74078 2423 0.02614711 50000 34000
## 2 64240 2266 0.02863606 54000 36000
## P75th
## 1 80000
## 2 80000
# Grad Students
grad_age_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv"
grad.age <- read.csv(grad_age_url, sep=",", header=T, stringsAsFactors = FALSE)
grad.age <- grad.age %>%
tbl_df() %>%
arrange(Major_category)
head(grad.age, 2)## # A tibble: 2 x 22
## Major~ Major Major_~ Grad_~ Grad~ Grad~ Grad~ Grad~ Grad_~ Grad~ Grad~
## <int> <chr> <chr> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
## 1 1101 AGRICU~ Agricu~ 17488 386 13104 11207 473 0.0348 67000 41600
## 2 1100 GENERA~ Agricu~ 44306 764 28930 23024 874 0.0293 68000 45000
## # ... with 11 more variables: Grad_P75 <dbl>, Nongrad_total <int>,
## # Nongrad_employed <int>, Nongrad_full_time_year_round <int>,
## # Nongrad_unemployed <int>, Nongrad_unemployment_rate <dbl>,
## # Nongrad_median <dbl>, Nongrad_P25 <int>, Nongrad_P75 <dbl>, Grad_share
## # <dbl>, Grad_premium <dbl>
# UnderGrad Students
rcnt_grad_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv"
rctgrad.age <- read.csv(rcnt_grad_url, sep=",", header=T, stringsAsFactors = FALSE)
rctgrad.age <- rctgrad.age %>%
tbl_df() %>%
arrange(Major_category)
head(rctgrad.age, 2)## # A tibble: 2 x 21
## Rank Majo~ Major Total Men Women Majo~ Share~ Samp~ Empl~ Full~ Part~
## <int> <int> <chr> <int> <int> <int> <chr> <dbl> <int> <int> <int> <int>
## 1 22 1104 FOOD~ NA NA NA Agri~ NA 36 3149 2558 1121
## 2 64 1101 AGRI~ 14240 9658 4582 Agri~ 0.322 273 12323 11119 2196
## # ... with 9 more variables: Full_time_year_round <int>, Unemployed <int>,
## # Unemployment_rate <dbl>, Median <int>, P25th <int>, P75th <int>,
## # College_jobs <int>, Non_college_jobs <int>, Low_wage_jobs <int>
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
College Majors The Economic Guide to picking a College Major
I come from India. There is a joke in my country which ges like this, “Indian parents give their kids FULL freedom to select a career of their choice, as long as, it is an engineer, doctor or a lawyer!”
I have always been curious on how does selection of a college major influence a person’s success? I want to examine which fields can guarantee financial success by performing hypothesis testing after analyzing the employability and median incomes.
What are the cases, and how many are there?
All_ages: This data represents a case of both undergrads and grad students from 173 majors offered by colleges in USA. Grad Students: This data is subset of above and each case represents majoes offered from list of 173 majors offered by colleges in USA for grad students over 25+ years of age. Under Grad Students: This data is subset of above and each case represents majoes offered from list of 173 majors offered by colleges in USA for undergrad students under 28 years of age.
Provide summary statistics relevant to your research question. For example, if you are comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(all.age)## Major_code Major Major_category Total
## Min. :1100 Length:173 Length:173 Min. : 2396
## 1st Qu.:2403 Class :character Class :character 1st Qu.: 24280
## Median :3608 Mode :character Mode :character Median : 75791
## Mean :3880 Mean : 230257
## 3rd Qu.:5503 3rd Qu.: 205763
## Max. :6403 Max. :3123510
## Employed Employed_full_time_year_round Unemployed
## Min. : 1492 Min. : 1093 Min. : 0
## 1st Qu.: 17281 1st Qu.: 12722 1st Qu.: 1101
## Median : 56564 Median : 39613 Median : 3619
## Mean : 166162 Mean : 126308 Mean : 9725
## 3rd Qu.: 142879 3rd Qu.: 111025 3rd Qu.: 8862
## Max. :2354398 Max. :1939384 Max. :147261
## Unemployment_rate Median P25th P75th
## Min. :0.00000 Min. : 35000 Min. :24900 Min. : 45800
## 1st Qu.:0.04626 1st Qu.: 46000 1st Qu.:32000 1st Qu.: 70000
## Median :0.05472 Median : 53000 Median :36000 Median : 80000
## Mean :0.05736 Mean : 56816 Mean :38697 Mean : 82506
## 3rd Qu.:0.06904 3rd Qu.: 65000 3rd Qu.:42000 3rd Qu.: 95000
## Max. :0.15615 Max. :125000 Max. :78000 Max. :210000
summary(grad.age)## Major_code Major Major_category Grad_total
## Min. :1100 Length:173 Length:173 Min. : 1542
## 1st Qu.:2403 Class :character Class :character 1st Qu.: 15284
## Median :3608 Mode :character Mode :character Median : 37872
## Mean :3880 Mean : 127672
## 3rd Qu.:5503 3rd Qu.: 148255
## Max. :6403 Max. :1184158
## Grad_sample_size Grad_employed Grad_full_time_year_round
## Min. : 22 Min. : 1008 Min. : 770
## 1st Qu.: 314 1st Qu.: 12659 1st Qu.: 9894
## Median : 688 Median : 28930 Median : 22523
## Mean : 2251 Mean : 94037 Mean : 72861
## 3rd Qu.: 2528 3rd Qu.:109944 3rd Qu.: 80794
## Max. :21994 Max. :915341 Max. :703347
## Grad_unemployed Grad_unemployment_rate Grad_median Grad_P25
## Min. : 0 Min. :0.00000 Min. : 47000 Min. :24500
## 1st Qu.: 453 1st Qu.:0.02607 1st Qu.: 65000 1st Qu.:45000
## Median : 1179 Median :0.03665 Median : 75000 Median :50000
## Mean : 3506 Mean :0.03934 Mean : 76756 Mean :52597
## 3rd Qu.: 3329 3rd Qu.:0.04805 3rd Qu.: 90000 3rd Qu.:60000
## Max. :35718 Max. :0.13851 Max. :135000 Max. :85000
## Grad_P75 Nongrad_total Nongrad_employed
## Min. : 65000 Min. : 2232 Min. : 1328
## 1st Qu.: 93000 1st Qu.: 20564 1st Qu.: 15914
## Median :108000 Median : 68993 Median : 50092
## Mean :112087 Mean : 214720 Mean : 154554
## 3rd Qu.:130000 3rd Qu.: 184971 3rd Qu.: 129179
## Max. :294000 Max. :2996892 Max. :2253649
## Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate
## Min. : 980 Min. : 0 Min. :0.00000
## 1st Qu.: 11755 1st Qu.: 880 1st Qu.:0.04198
## Median : 38384 Median : 3157 Median :0.05103
## Mean : 120737 Mean : 8486 Mean :0.05395
## 3rd Qu.: 103629 3rd Qu.: 7409 3rd Qu.:0.06439
## Max. :1882507 Max. :136978 Max. :0.16091
## Nongrad_median Nongrad_P25 Nongrad_P75 Grad_share
## Min. : 37000 Min. :25000 Min. : 48000 Min. :0.09632
## 1st Qu.: 48700 1st Qu.:34000 1st Qu.: 72000 1st Qu.:0.26757
## Median : 55000 Median :38000 Median : 80000 Median :0.39875
## Mean : 58584 Mean :40078 Mean : 84333 Mean :0.40059
## 3rd Qu.: 65000 3rd Qu.:44000 3rd Qu.: 97000 3rd Qu.:0.49912
## Max. :126000 Max. :80000 Max. :215000 Max. :0.93117
## Grad_premium
## Min. :-0.0250
## 1st Qu.: 0.2308
## Median : 0.3208
## Mean : 0.3285
## 3rd Qu.: 0.4000
## Max. : 1.6471
summary(rctgrad.age)## Rank Major_code Major Total
## Min. : 1 Min. :1100 Length:173 Min. : 124
## 1st Qu.: 44 1st Qu.:2403 Class :character 1st Qu.: 4550
## Median : 87 Median :3608 Mode :character Median : 15104
## Mean : 87 Mean :3880 Mean : 39370
## 3rd Qu.:130 3rd Qu.:5503 3rd Qu.: 38910
## Max. :173 Max. :6403 Max. :393735
## NA's :1
## Men Women Major_category ShareWomen
## Min. : 119 Min. : 0 Length:173 Min. :0.0000
## 1st Qu.: 2178 1st Qu.: 1778 Class :character 1st Qu.:0.3360
## Median : 5434 Median : 8386 Mode :character Median :0.5340
## Mean : 16723 Mean : 22647 Mean :0.5222
## 3rd Qu.: 14631 3rd Qu.: 22554 3rd Qu.:0.7033
## Max. :173809 Max. :307087 Max. :0.9690
## NA's :1 NA's :1 NA's :1
## Sample_size Employed Full_time Part_time
## Min. : 2.0 Min. : 0 Min. : 111 Min. : 0
## 1st Qu.: 39.0 1st Qu.: 3608 1st Qu.: 3154 1st Qu.: 1030
## Median : 130.0 Median : 11797 Median : 10048 Median : 3299
## Mean : 356.1 Mean : 31193 Mean : 26029 Mean : 8832
## 3rd Qu.: 338.0 3rd Qu.: 31433 3rd Qu.: 25147 3rd Qu.: 9948
## Max. :4212.0 Max. :307933 Max. :251540 Max. :115172
##
## Full_time_year_round Unemployed Unemployment_rate Median
## Min. : 111 Min. : 0 Min. :0.00000 Min. : 22000
## 1st Qu.: 2453 1st Qu.: 304 1st Qu.:0.05031 1st Qu.: 33000
## Median : 7413 Median : 893 Median :0.06796 Median : 36000
## Mean : 19694 Mean : 2416 Mean :0.06819 Mean : 40151
## 3rd Qu.: 16891 3rd Qu.: 2393 3rd Qu.:0.08756 3rd Qu.: 45000
## Max. :199897 Max. :28169 Max. :0.17723 Max. :110000
##
## P25th P75th College_jobs Non_college_jobs
## Min. :18500 Min. : 22000 Min. : 0 Min. : 0
## 1st Qu.:24000 1st Qu.: 42000 1st Qu.: 1675 1st Qu.: 1591
## Median :27000 Median : 47000 Median : 4390 Median : 4595
## Mean :29501 Mean : 51494 Mean : 12323 Mean : 13284
## 3rd Qu.:33000 3rd Qu.: 60000 3rd Qu.: 14444 3rd Qu.: 11783
## Max. :95000 Max. :125000 Max. :151643 Max. :148395
##
## Low_wage_jobs
## Min. : 0
## 1st Qu.: 340
## Median : 1231
## Mean : 3859
## 3rd Qu.: 3466
## Max. :48207
##
hist(all.age$Median, main = "All Student Median Income", xlab = "Median Incomes (USD)", col = "blue")combine.unemployment <- cbind(all.age$Unemployment_rate, rctgrad.age$Unemployment_rate, grad.age$Grad_unemployment_rate)
barplot(combine.unemployment/nrow(combine.unemployment), names.arg = c("All", "Recent Grad", "Grad Student"), xlab = "Unemployment Rate", col = heat.colors(nrow(combine.unemployment)))The above graphs already give us a flavor of the data that graduate students clearly have much higher median income compared to students who recently completed under grads.
unempl <- cbind(all.age$Unemployment_rate, rctgrad.age$Unemployment_rate, grad.age$Grad_unemployment_rate)
boxplot(unempl,names = c("All", "Recent Grad", "Grad Student"), ylab = "Unemployment Rate")summary(rctgrad.age$Median)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40151 45000 110000
hist(rctgrad.age$Median, main = "Histogram for Median Income Recent Grads", xlab = "Median Income by Major Recent Grads (USD)", col = "dark blue")summary(grad.age$Grad_median)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47000 65000 75000 76756 90000 135000
hist(grad.age$Grad_median, main = "Histogram for Median Income Grad Students", xlab = "Median Income by Major Grad Student (USD)", col = "dark blue")medsal <- cbind(all.age$Median, rctgrad.age$Median, grad.age$Grad_median)
boxplot(medsal, names = c("All", "Recent Grad", "Grad Student"), ylab = "Median Salary USD") As seen above the for graduates there are greater amount of outliers. So graduate degree helps with much greater salaries.
all_age_contin <- all.age %>% dplyr::select(Major, Employed, Unemployed) # For user-freindliness we'll pull major, number employed, number unemployed.
head(all_age_contin)## Major Employed Unemployed
## 1 GENERAL AGRICULTURE 90245 2423
## 2 AGRICULTURE PRODUCTION AND MANAGEMENT 76865 2266
## 3 AGRICULTURAL ECONOMICS 26321 821
## 4 ANIMAL SCIENCES 81177 3619
## 5 FOOD SCIENCE 17281 894
## 6 PLANT SCIENCE AND AGRONOMY 63043 2070
chisq.test(all_age_contin[,-1]) #We remove the major names for the chi-squared test##
## Pearson's Chi-squared test
##
## data: all_age_contin[, -1]
## X-squared = 96644, df = 172, p-value < 2.2e-16
Since the p-value is less than 0.05, we can reject the null hypothesis that the choice of major does not affects employment status, and we accept the alternative hypothesis that choice of major does affect employment status in the all ages category.
grd_st_contin <- grad.age %>% dplyr::select(Major, Grad_employed, Grad_unemployed)# For user-freindliness we'll pull major, number employed, number unemployed.
head(grd_st_contin)## # A tibble: 6 x 3
## Major Grad_employed Grad_unemployed
## <chr> <int> <int>
## 1 AGRICULTURE PRODUCTION AND MANAGEMENT 13104 473
## 2 GENERAL AGRICULTURE 28930 874
## 3 FORESTRY 16831 725
## 4 NATURAL RESOURCES MANAGEMENT 23394 711
## 5 PLANT SCIENCE AND AGRONOMY 22782 735
## 6 AGRICULTURAL ECONOMICS 10592 216
chisq.test(grd_st_contin[,-1]) #We remove the major names for the chi-squared test##
## Pearson's Chi-squared test
##
## data: grd_st_contin[, -1]
## X-squared = 62013, df = 172, p-value < 2.2e-16
As the p<0.05, we reject the null hypothesis and accept the alternative hypothesis that major choice at the grad level affects employment status.
rct_gr_contin <- rctgrad.age %>% dplyr::select(Major,Employed,Unemployed) %>% filter(Major != "MILITARY TECHNOLOGIES" ) # military technology had 0 in both employed and unemployed columns, was excluded.
rct_gr_contin## # A tibble: 172 x 3
## Major Employed Unemployed
## <chr> <int> <int>
## 1 FOOD SCIENCE 3149 338
## 2 AGRICULTURE PRODUCTION AND MANAGEMENT 12323 649
## 3 GENERAL AGRICULTURE 8884 178
## 4 AGRICULTURAL ECONOMICS 2174 182
## 5 NATURAL RESOURCES MANAGEMENT 11797 842
## 6 FORESTRY 3007 322
## 7 SOIL SCIENCE 613 0
## 8 PLANT SCIENCE AND AGRONOMY 6594 314
## 9 ANIMAL SCIENCES 17112 917
## 10 MISCELLANEOUS AGRICULTURE 1290 82
## # ... with 162 more rows
chisq.test(rct_gr_contin[,-1]) #We remove the major names for the chi-squared test##
## Pearson's Chi-squared test
##
## data: rct_gr_contin[, -1]
## X-squared = 29941, df = 171, p-value < 2.2e-16
Here too we have to reject the null hypothesis
As we stated in the Data section, exploration of the median salary data in the appendix shows that quantitative analysis majors, the STEM majors, appear to have more earning potential than qualitative analysis majors, such as Liberal Arts. Since median salary is a numerical measurement, it is appropriate to use a Student’s t-test7 or a Kolmogorov-Smirnov8 test to compare similarity between data sets. The Student’s t-test is a parametric test that compare’s against the t distribution. The Kolmogorov-Smirnov is a non-parametric test, in that it does not assume the survey data is drawn from a population with a given distribution, instead it measures likelihood of similarity by comparing the biggest difference in to data set’s continuous probability distribution. Since the salary data has a right-skew across all attainment levels, adding a non-parametric test will increase the robustness of this analysis.
To make these comparisons, we must bare in mind that we have (14 major categories x 3 attainment levels) 42 categories that have to be combined in groups of 2 for a total of \(C(42,2) = \frac{42!}{(2!*42!)} = 861\) combinations. This is prohibitively long given the time constraints for this project. Therefore, We will analyze 4 major categories from the all ages set to bring us to \(C(4,2) = \frac{4!}{(2!*2!)} = 6\) combinations. These major categories are, Engineering, Physical Sciences, Liberal Arts, and Psychology & Social Work.
The Null hypothesis is that there is no difference between median salaries of Engineering majors and Physical Science Majors. Initial two-sided tests, that only check that the distributions are different, and not that one is greater or less than the other, showed significance in all cases. We show below the results of single sided tests to definitely say that median salary of one degree category is greater than the other.
all_ages_eng = filter(all.age, grepl("ENGI",all.age$Major))
all_ages_sci = filter(all.age, grepl("PHYS",all.age$Major))
boxplot(all_ages_eng$Median, all_ages_sci$Median, names = c("Engineering", "Physical Sciences"), ylab = "Median Salary USD")t.test(all_ages_eng$Median, all_ages_sci$Median, alternative = "greater")##
## Welch Two Sample t-test
##
## data: all_ages_eng$Median and all_ages_sci$Median
## t = 2.6464, df = 11.32, p-value = 0.01113
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 5147.67 Inf
## sample estimates:
## mean of x mean of y
## 78730.77 62800.00
ks.test(all_ages_eng$Median, all_ages_sci$Median, alternative = "less") #KS test has opposite sign convention than t test## Warning in ks.test(all_ages_eng$Median, all_ages_sci$Median, alternative =
## "less"): cannot compute exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: all_ages_eng$Median and all_ages_sci$Median
## D^- = 0.46154, p-value = 0.0738
## alternative hypothesis: the CDF of x lies below that of y
This time we repeat the same Null and Alternative hypotheses with Engineering and Liberal Arts.
boxplot(all_ages_eng$Median, all_ages_la$Median, names = c("Engineering", "Liberal Arts"), ylab = "Median Salary USD")t.test(all_ages_eng$Median, all_ages_la$Median, alternative = "greater")##
## Welch Two Sample t-test
##
## data: all_ages_eng$Median and all_ages_la$Median
## t = 9.522, df = 35.797, p-value = 1.197e-11
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 27385.87 Inf
## sample estimates:
## mean of x mean of y
## 78730.77 45441.67
ks.test(all_ages_eng$Median, all_ages_la$Median, alternative = "less") #KS test has opposite sign convention than t test## Warning in ks.test(all_ages_eng$Median, all_ages_la$Median, alternative =
## "less"): cannot compute exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: all_ages_eng$Median and all_ages_la$Median
## D^- = 0.91667, p-value = 1.017e-06
## alternative hypothesis: the CDF of x lies below that of y
The median salary of Engineering majors is higher than that of Liberal Arts majors at the 95% confidence level.
boxplot(all_ages_la$Median, all_ages_sci$Median, names = c("Liberal Arts", "Physical Sciences"), ylab = "Median Salary USD")t.test(all_ages_la$Median, all_ages_sci$Median, alternative = "less")##
## Welch Two Sample t-test
##
## data: all_ages_la$Median and all_ages_sci$Median
## t = -3.0526, df = 9.093, p-value = 0.006787
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -6946.71
## sample estimates:
## mean of x mean of y
## 45441.67 62800.00
ks.test(all_ages_la$Median, all_ages_sci$Median, alternative = "greater") #KS test has opposite sign convention than t test## Warning in ks.test(all_ages_la$Median, all_ages_sci$Median, alternative =
## "greater"): cannot compute exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: all_ages_la$Median and all_ages_sci$Median
## D^+ = 0.70833, p-value = 0.008094
## alternative hypothesis: the CDF of x lies above that of y
In terms of median pay the ranking is as follows: 1. Engineering 2. Physical Science 3. Liberal Arts 4. Psycology and Social Work Additionally, the Industrial and Organizational Psycology Major is similar in pay to Physical Sciences.
Job market pressure can have an impact on both median salary and unemployment rate. If a field has low demand but high supply this can depress the salary and increase the unemployment rate. Conversely, a high demand/low supply field will see increased salaries and decreased unemployment rates. Another effect to consider is that people in over-subscribed field may spend a greater time looking for a job, which would also decrease median salary as they may be unemployed or underemployed during the job hunt. This effect could show in the data as a correlation between unemployment rate and salary.
To test if there is a connection between unemployment rate and median salary, we will take the “all_ages” data set and create linear regression models. If the residuals of the model do not show the necessary behavior of Normal Distribution and Constant Variance, we will perform a Box-Cox transformation on the data to get an exponential factor to improve the model.
fit1<-lm(all.age$Median ~ all.age$Unemployment_rate)
summary(fit1)##
## Call:
## lm(formula = all.age$Median ~ all.age$Unemployment_rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23370 -8995 -3272 8079 64676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70097 3380 20.738 < 2e-16 ***
## all.age$Unemployment_rate -231551 55906 -4.142 5.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14060 on 171 degrees of freedom
## Multiple R-squared: 0.09117, Adjusted R-squared: 0.08586
## F-statistic: 17.15 on 1 and 171 DF, p-value: 5.406e-05
ggplot(all.age, aes(x = Unemployment_rate, y = Median)) +
geom_point(color = 'blue')+
geom_smooth(method = "lm", formula = y~x)hist(resid(fit1))plot(fitted(fit1), resid(fit1))myt <- boxcox(fit1)myt_df <- as.data.frame(myt)
optimal_lambda = myt_df[which.max(myt$y),1]
optimal_lambda## [1] -1.070707
fit2 <- lm(all.age$Median^optimal_lambda ~ all.age$Unemployment_rate)
summary(fit2)##
## Call:
## lm(formula = all.age$Median^optimal_lambda ~ all.age$Unemployment_rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.646e-06 -1.475e-06 2.614e-07 1.295e-06 5.141e-06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.755e-06 4.666e-07 14.476 < 2e-16 ***
## all.age$Unemployment_rate 3.272e-05 7.718e-06 4.239 3.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.941e-06 on 171 degrees of freedom
## Multiple R-squared: 0.09509, Adjusted R-squared: 0.0898
## F-statistic: 17.97 on 1 and 171 DF, p-value: 3.664e-05
hist(resid(fit2))plot(fitted(fit2), resid(fit2))qqnorm(resid(fit2))
qqline(resid(fit2))all_ages <- all.ages %>% mutate(transMedian = Median^optimal_lambda)
head(all_ages)## # A tibble: 6 x 12
## Major~ Major Major~ Total Empl~ Emplo~ Unem~ Unemp~ Medi~ P25th P75th
## <int> <chr> <chr> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
## 1 1100 GENERA~ Agric~ 128148 90245 74078 2423 0.0261 50000 34000 80000
## 2 1101 AGRICU~ Agric~ 95326 76865 64240 2266 0.0286 54000 36000 80000
## 3 1102 AGRICU~ Agric~ 33955 26321 22810 821 0.0302 63000 40000 98000
## 4 1103 ANIMAL~ Agric~ 103549 81177 64937 3619 0.0427 46000 30000 72000
## 5 1104 FOOD S~ Agric~ 24280 17281 12722 894 0.0492 62000 38500 90000
## 6 1105 PLANT ~ Agric~ 79409 63043 51077 2070 0.0318 50000 35000 75000
## # ... with 1 more variable: transMedian <dbl>
ggplot(all_ages, aes(x = Unemployment_rate, y = transMedian)) +
geom_point(color = 'blueviolet')+
geom_smooth(method = "lm", formula = y~x)all_ages_no_outlr <- all_ages %>% filter(Unemployment_rate != max(Unemployment_rate) & Unemployment_rate != 0)
fit3 <- lm(all_ages_no_outlr$transMedian ~ all_ages_no_outlr$Unemployment_rate)
summary(fit3)##
## Call:
## lm(formula = all_ages_no_outlr$transMedian ~ all_ages_no_outlr$Unemployment_rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.616e-06 -1.475e-06 1.990e-07 1.290e-06 5.153e-06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.611e-06 5.382e-07 12.285 < 2e-16
## all_ages_no_outlr$Unemployment_rate 3.539e-05 8.999e-06 3.933 0.000122
##
## (Intercept) ***
## all_ages_no_outlr$Unemployment_rate ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.949e-06 on 168 degrees of freedom
## Multiple R-squared: 0.08432, Adjusted R-squared: 0.07887
## F-statistic: 15.47 on 1 and 168 DF, p-value: 0.0001225
hist(resid(fit3))plot(fitted(fit3), resid(fit3))qqnorm(resid(fit3))
qqline(resid(fit3))ggplot(all_ages_no_outlr, aes(x = Unemployment_rate, y = transMedian)) +
geom_point(color = 'firebrick')+
geom_smooth(method = "lm", formula = y~x)mjr_umploy <- all_ages %>% dplyr::select(Major,Unemployment_rate) %>% arrange(Unemployment_rate)
head(mjr_umploy, 10)## # A tibble: 10 x 2
## Major Unemployment_rate
## <chr> <dbl>
## 1 EDUCATIONAL ADMINISTRATION AND SUPERVISION 0
## 2 GEOLOGICAL AND GEOPHYSICAL ENGINEERING 0
## 3 PHARMACOLOGY 0.0161
## 4 MATERIALS SCIENCE 0.0223
## 5 MATHEMATICS AND COMPUTER SCIENCE 0.0249
## 6 GENERAL AGRICULTURE 0.0261
## 7 TREATMENT THERAPY PROFESSIONS 0.0263
## 8 NURSING 0.0268
## 9 AGRICULTURE PRODUCTION AND MANAGEMENT 0.0286
## 10 AGRICULTURAL ECONOMICS 0.0302
tail(mjr_umploy, 10)## # A tibble: 10 x 2
## Major Unemployment_rate
## <chr> <dbl>
## 1 ARCHITECTURE 0.0860
## 2 ASTRONOMY AND ASTROPHYSICS 0.0860
## 3 SOCIAL PSYCHOLOGY 0.0873
## 4 COMPUTER PROGRAMMING AND DATA PROCESSING 0.0903
## 5 VISUAL AND PERFORMING ARTS 0.0947
## 6 LIBRARY SCIENCE 0.0948
## 7 SCHOOL STUDENT COUNSELING 0.102
## 8 MILITARY TECHNOLOGIES 0.102
## 9 CLINICAL PSYCHOLOGY 0.103
## 10 MISCELLANEOUS FINE ARTS 0.156
mjr_salary <- all_ages %>% dplyr::select(Major,Median) %>% arrange(Median)
head(mjr_salary, 10)## # A tibble: 10 x 2
## Major Median
## <chr> <int>
## 1 NEUROSCIENCE 35000
## 2 EARLY CHILDHOOD EDUCATION 35300
## 3 STUDIO ARTS 37600
## 4 HUMAN SERVICES AND COMMUNITY ORGANIZATION 38000
## 5 COUNSELING PSYCHOLOGY 39000
## 6 VISUAL AND PERFORMING ARTS 40000
## 7 ELEMENTARY EDUCATION 40000
## 8 TEACHER EDUCATION: MULTIPLE LEVELS 40000
## 9 LIBRARY SCIENCE 40000
## 10 COMPOSITION AND RHETORIC 40000
tail(mjr_salary, 10)## # A tibble: 10 x 2
## Major Median
## <chr> <int>
## 1 GEOLOGICAL AND GEOPHYSICAL ENGINEERING 85000
## 2 CHEMICAL ENGINEERING 86000
## 3 ELECTRICAL ENGINEERING 88000
## 4 MATHEMATICS AND COMPUTER SCIENCE 92000
## 5 MINING AND MINERAL ENGINEERING 92000
## 6 NUCLEAR ENGINEERING 95000
## 7 METALLURGICAL ENGINEERING 96000
## 8 NAVAL ARCHITECTURE AND MARINE ENGINEERING 97000
## 9 PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION 106000
## 10 PETROLEUM ENGINEERING 125000
Initially the data had marginal behavior regarding the residuals. The Box cox transformation did make the residuals Normal and Homoskedacstic. In that regard the transformed model is fit to make predictions. Both the initial slope of the linear regression model of -231551 and the Box Cox exponent of -1.07 shows that unemployment rate and median salary are inversely related. That is low unemployment rates tend to have higher median salaries and high unemployment rates tend to lower salaries. This relationship is statistically significant, with a p-value of 0.0001225, even after influencing outliers were removed. However, the effect is weak with an \(R^2\) of 0.08432 after outleirs are removed. This means that only about 8.432% of the variability of median salary can be explained by unemployment rate. We suggest to students who are researching the prospects of college majors is to treat underemployment rates and salary statistics separately. Do not just go off of advise like, “You’ll make a mint in this field” or “They’re hiring a lot of people in that field”. It does no good if a student accrues $100,000 in debt to be virtually guaranteed a job where they can’t pay the debt off, or they could pay it off if they get a job in that field, but the chances of that are small.
We find that choice in college major has a significant effect on median salary and unemployment rate. This effect is seen at all age levels. Higher salaries and lower unemployment tend to favor STEM majors. Gender balance of majors also plays a significant effect on median salary. These findings that STEM and Gender affect median salary seem to be interrelated as the STEM majors tend to be male majority. There is a statistically, but not necessarily practically, significance between unemployment rate and median pay.