DATA606 Final Project

The Economic value of college accademic majors in the US.

Data Preparation

1. All Students (all ages)

allStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/all-ages.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)

datatable(allStudents, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

2. Graduate Students attending graduate schools

gradStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/grad-students.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)

datatable(gradStudents, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

3. Recent Graduates

recentGraduates <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/recent-grads.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>%  arrange(Major_category)

datatable(recentGraduates, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What college majors offer the best employment opportunities and salaries in the US?

Cases

What are the cases, and how many are there?

dim(allStudents)

## [1] 173  11

dim(recentGraduates)

## [1] 173  21

dim(gradStudents)

## [1] 173  22

There are 173 cases and 11 variables (columns) in the `all-ages` data set, it contains basic earnings and labor force information for all ages. The `grad-students` (ages 25+) data set contains 173 cases and 22 variables, while the `recent-grads` (ages <28) data set contains 173 cases and 21 variables. It contains a more detailed breakdown, including by sex and by the type of job they got

Data collection

Describe the method of data collection.

This data set is a survey data collected by the American Community Survey 2010-2012 Public Use Microdata Series (PUMS).

I read a 2014 article: The Economic Guide To Picking A College Major by fivethirtyeight about how college majors affect employment and unemployment rates in the US, and decided to have a closer look on this using a readily available data set that has been collected to help explain the situation in an exploratory and analytical way.

Type of study

What type of study is this (observational/experiment)?

This is an observational study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

This was pulled from the ACS website by Fivethirtyeigth and stored in their Github Repo after performing some categorising on the cases

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variables is College Major. It is a categorical variable

Explanatory

What is the explanatory variable(s), and what type is it (numerical/categorival)?

The explanatory variables include:

The counts of employed and unemployed degree holders and
The statistics of their income.

These variables are numerical

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

allStudents

str(allStudents)

## Classes 'tbl_df', 'tbl' and 'data.frame':    173 obs. of  11 variables:
##  $ Major_code                   : int  1100 1101 1102 1103 1104 1105 1106 1199 1302 1303 ...
##  $ Major                        : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_category               : chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
##  $ Total                        : int  128148 95326 33955 103549 24280 79409 6586 8549 69447 83188 ...
##  $ Employed                     : int  90245 76865 26321 81177 17281 63043 4926 6392 48228 65937 ...
##  $ Employed_full_time_year_round: int  74078 64240 22810 64937 12722 51077 4042 5074 39613 50595 ...
##  $ Unemployed                   : int  2423 2266 821 3619 894 2070 264 261 2144 3789 ...
##  $ Unemployment_rate            : num  0.0261 0.0286 0.0302 0.0427 0.0492 ...
##  $ Median                       : int  50000 54000 63000 46000 62000 50000 63000 52000 58000 52000 ...
##  $ P25th                        : int  34000 36000 40000 30000 38500 35000 39400 35000 40500 37100 ...
##  $ P75th                        : num  80000 80000 98000 72000 90000 75000 88000 75000 80000 75000 ...

summary(allStudents)

##    Major_code      Major           Major_category         Total        
##  Min.   :1100   Length:173         Length:173         Min.   :   2396  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  24280  
##  Median :3608   Mode  :character   Mode  :character   Median :  75791  
##  Mean   :3880                                         Mean   : 230257  
##  3rd Qu.:5503                                         3rd Qu.: 205763  
##  Max.   :6403                                         Max.   :3123510  
##     Employed       Employed_full_time_year_round   Unemployed    
##  Min.   :   1492   Min.   :   1093               Min.   :     0  
##  1st Qu.:  17281   1st Qu.:  12722               1st Qu.:  1101  
##  Median :  56564   Median :  39613               Median :  3619  
##  Mean   : 166162   Mean   : 126308               Mean   :  9725  
##  3rd Qu.: 142879   3rd Qu.: 111025               3rd Qu.:  8862  
##  Max.   :2354398   Max.   :1939384               Max.   :147261  
##  Unemployment_rate     Median           P25th           P75th       
##  Min.   :0.00000   Min.   : 35000   Min.   :24900   Min.   : 45800  
##  1st Qu.:0.04626   1st Qu.: 46000   1st Qu.:32000   1st Qu.: 70000  
##  Median :0.05472   Median : 53000   Median :36000   Median : 80000  
##  Mean   :0.05736   Mean   : 56816   Mean   :38697   Mean   : 82506  
##  3rd Qu.:0.06904   3rd Qu.: 65000   3rd Qu.:42000   3rd Qu.: 95000  
##  Max.   :0.15615   Max.   :125000   Max.   :78000   Max.   :210000

gradStudents

str(gradStudents)

## Classes 'tbl_df', 'tbl' and 'data.frame':    173 obs. of  22 variables:
##  $ Major_code                  : int  1101 1100 1302 1303 1105 1102 1106 1103 1199 1104 ...
##  $ Major                       : chr  "AGRICULTURE PRODUCTION AND MANAGEMENT" "GENERAL AGRICULTURE" "FORESTRY" "NATURAL RESOURCES MANAGEMENT" ...
##  $ Major_category              : chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
##  $ Grad_total                  : int  17488 44306 24713 29357 30983 14800 3335 56807 5032 14521 ...
##  $ Grad_sample_size            : int  386 764 487 659 624 305 61 1335 98 266 ...
##  $ Grad_employed               : int  13104 28930 16831 23394 22782 10592 2284 47755 2758 10857 ...
##  $ Grad_full_time_year_round   : int  11207 23024 14102 19087 18312 8768 1641 39047 2276 8074 ...
##  $ Grad_unemployed             : int  473 874 725 711 735 216 34 596 261 370 ...
##  $ Grad_unemployment_rate      : num  0.0348 0.0293 0.0413 0.0295 0.0313 ...
##  $ Grad_median                 : num  67000 68000 78000 70000 67000 80000 65000 70300 54000 72000 ...
##  $ Grad_P25                    : int  41600 45000 52000 50000 45000 53000 50000 48000 45000 50000 ...
##  $ Grad_P75                    : num  100000 104000 110000 100000 100000 120000 91000 104000 81000 110000 ...
##  $ Nongrad_total               : int  89169 123984 67649 77101 76190 33049 6242 94910 8092 22853 ...
##  $ Nongrad_employed            : int  71781 86631 46815 60690 60241 25557 4654 74896 5978 16298 ...
##  $ Nongrad_full_time_year_round: int  61335 72409 39048 48256 49506 22496 3917 61629 4707 12431 ...
##  $ Nongrad_unemployed          : int  1869 2352 1885 3413 1899 734 264 3101 239 681 ...
##  $ Nongrad_unemployment_rate   : num  0.0254 0.0264 0.0387 0.0532 0.0306 ...
##  $ Nongrad_median              : num  55000 50000 59000 53000 50000 63000 65000 48000 55000 63000 ...
##  $ Nongrad_P25                 : int  38000 34000 42000 38000 35000 40000 41000 32000 39000 40000 ...
##  $ Nongrad_P75                 : num  80000 80000 80000 75000 75000 99000 89000 75000 78000 92000 ...
##  $ Grad_share                  : num  0.164 0.263 0.268 0.276 0.289 ...
##  $ Grad_premium                : num  0.218 0.36 0.322 0.321 0.34 ...

summary(gradStudents)

##    Major_code      Major           Major_category       Grad_total     
##  Min.   :1100   Length:173         Length:173         Min.   :   1542  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  15284  
##  Median :3608   Mode  :character   Mode  :character   Median :  37872  
##  Mean   :3880                                         Mean   : 127672  
##  3rd Qu.:5503                                         3rd Qu.: 148255  
##  Max.   :6403                                         Max.   :1184158  
##  Grad_sample_size Grad_employed    Grad_full_time_year_round
##  Min.   :   22    Min.   :  1008   Min.   :   770           
##  1st Qu.:  314    1st Qu.: 12659   1st Qu.:  9894           
##  Median :  688    Median : 28930   Median : 22523           
##  Mean   : 2251    Mean   : 94037   Mean   : 72861           
##  3rd Qu.: 2528    3rd Qu.:109944   3rd Qu.: 80794           
##  Max.   :21994    Max.   :915341   Max.   :703347           
##  Grad_unemployed Grad_unemployment_rate  Grad_median        Grad_P25    
##  Min.   :    0   Min.   :0.00000        Min.   : 47000   Min.   :24500  
##  1st Qu.:  453   1st Qu.:0.02607        1st Qu.: 65000   1st Qu.:45000  
##  Median : 1179   Median :0.03665        Median : 75000   Median :50000  
##  Mean   : 3506   Mean   :0.03934        Mean   : 76756   Mean   :52597  
##  3rd Qu.: 3329   3rd Qu.:0.04805        3rd Qu.: 90000   3rd Qu.:60000  
##  Max.   :35718   Max.   :0.13851        Max.   :135000   Max.   :85000  
##     Grad_P75      Nongrad_total     Nongrad_employed 
##  Min.   : 65000   Min.   :   2232   Min.   :   1328  
##  1st Qu.: 93000   1st Qu.:  20564   1st Qu.:  15914  
##  Median :108000   Median :  68993   Median :  50092  
##  Mean   :112087   Mean   : 214720   Mean   : 154554  
##  3rd Qu.:130000   3rd Qu.: 184971   3rd Qu.: 129179  
##  Max.   :294000   Max.   :2996892   Max.   :2253649  
##  Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate
##  Min.   :    980              Min.   :     0     Min.   :0.00000          
##  1st Qu.:  11755              1st Qu.:   880     1st Qu.:0.04198          
##  Median :  38384              Median :  3157     Median :0.05103          
##  Mean   : 120737              Mean   :  8486     Mean   :0.05395          
##  3rd Qu.: 103629              3rd Qu.:  7409     3rd Qu.:0.06439          
##  Max.   :1882507              Max.   :136978     Max.   :0.16091          
##  Nongrad_median    Nongrad_P25     Nongrad_P75       Grad_share     
##  Min.   : 37000   Min.   :25000   Min.   : 48000   Min.   :0.09632  
##  1st Qu.: 48700   1st Qu.:34000   1st Qu.: 72000   1st Qu.:0.26757  
##  Median : 55000   Median :38000   Median : 80000   Median :0.39875  
##  Mean   : 58584   Mean   :40078   Mean   : 84333   Mean   :0.40059  
##  3rd Qu.: 65000   3rd Qu.:44000   3rd Qu.: 97000   3rd Qu.:0.49912  
##  Max.   :126000   Max.   :80000   Max.   :215000   Max.   :0.93117  
##   Grad_premium    
##  Min.   :-0.0250  
##  1st Qu.: 0.2308  
##  Median : 0.3208  
##  Mean   : 0.3285  
##  3rd Qu.: 0.4000  
##  Max.   : 1.6471

Accademic Majors Categories distribution

Let’s see how the Accademic Majors Categories are distributed by subsetting the Major_category variable from the `all-ages` data set:

categories <- as.data.frame(table(allStudents$Major_category))
names(categories) <- c('category', 'frequency')

ggplot(categories, aes(x= reorder(category, frequency), y=frequency)) + 
  geom_bar(stat = "identity", fill = "steelblue") +
 xlab("Accademic Major") + ylab("Frequency") +
  ggtitle("Accademic majors Categories vs Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x=element_blank(),
        plot.title = element_text(color="black", size=14,hjust = 0.5)) + coord_flip() + theme(legend.position="none")

So, it appears engineering students tend to be in a better position to land jobs easily as the number of jobs demanding their skills seem to be high

Salary distribution for all ages

ggplot(allStudents, aes(x=Median)) + 
  ggtitle("All ages' Median Salary") +
  geom_histogram(fill = "steelblue", color='white', binwidth = 10000) +
  theme(axis.title.x=element_blank(), 
        plot.title = element_text(color="black", size=14,hjust = 0.5))

The salary distribution is right skewed which is andicator that highly paying jobs are few and or the chances to land them is very slim.

Distribution of all unemployed students

all <- allStudents$Unemployment_rate
recent <- recentGraduates$Unemployment_rate
graduates <- gradStudents$Grad_unemployment_rate

allUnemployedRate <- cbind(all, recent, graduates)

barplot(allUnemployedRate/nrow(allUnemployedRate), names.arg = c('All Students', 'Recent Graduates', 'Graduate Students'), xlab = "Unemployment Rate", col = heat.colors(nrow(allUnemployedRate)))

Graduate students have better chances of landing jobs than recent/fresh graduates. It is obvious that organisations tend to go more for experienced workforce than otherwise.

Chi-Tests

Next, performing Chi-Squared Tests for the three data sets but first, I will have to extract the Accademic major, number employed, and number number unemployed from the data sets

All ages

majorsAllStatus <- allStudents %>% dplyr::select(Major, Employed, Unemployed)

datatable(majorsAllStatus, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

chisq.test(majorsAllStatus[,-1])

## 
##  Pearson's Chi-squared test
## 
## data:  majorsAllStatus[, -1]
## X-squared = 96644, df = 172, p-value < 2.2e-16

From the above, the p-value of `2.2e-16` is less than 0.05 which implies that for all ages:

\(H_0\): the choice of accademic major does not affect employment status does not stand
\(H_A\): the choice of accademic major affects employment status stands

For Graduates

gradMajorsStatus <- gradStudents %>% dplyr::select(Major, Grad_employed, Grad_unemployed)

datatable(gradMajorsStatus, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

chisq.test(gradMajorsStatus[,-1])

## 
##  Pearson's Chi-squared test
## 
## data:  gradMajorsStatus[, -1]
## X-squared = 62013, df = 172, p-value < 2.2e-16

From the above, the p-value of `2.2e-16` is less than 0.05 which implies that at the graduate level:

\(H_0\): the choice of accademic major does not affect employment status does not stand
\(H_A\): the choice of accademic major affects employment status stands

Recent Graduates

Here, I will filter out the `Military technology` major as it has 0 for both employed and unemployed variables

recentGradMajorsStatus <- recentGraduates %>% dplyr::select(Major, Employed, Unemployed) %>% filter(Major != "MILITARY TECHNOLOGIES" )

datatable(recentGradMajorsStatus, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

chisq.test(recentGradMajorsStatus[,-1])

## 
##  Pearson's Chi-squared test
## 
## data:  recentGradMajorsStatus[, -1]
## X-squared = 29941, df = 171, p-value < 2.2e-16

From this also:

\(H_0\): the choice of accademic major does not affect employment status does not stand
\(H_A\): the choice of accademic major affects employment status stands

T and Kolmogorov-Smirnov Tests

I will use these tests on Student’s Median Salary. From the exploratory analysis above, graduate students have better chances of landing jobs than recent/fresh graduates. It is obvious that organisations tend to go more for experienced workforce than otherwise. Also, students on STEM majors appear to have better opportunity to earn better than qualitative analysis majors. I will use the t-tests to compare median salaries between two majors since the median salaries is numerical, while the Kolmogorov-Smirnov measures likelihood of similarity by comparing the biggest difference in two data set’s continuous probability distribution. I will select at least 4 major categories from the data sets because trying to compare all the majors might be a daunting task as there are about 14 majors in each of the three data sets. I will then compare the categories in pairs.

Comparing the majors

Engineering and Physical Sciences

Here, the Null hypothesis is that there is no difference between median salaries of Engineering majors and Physical Science Majors. Initial two-sided tests, that only check that the distributions are different, and not that one is greater or less than the other, showed significance in all cases. We show below the results of single sided tests to definitely say that median salary of one degree category is greater than the other.

\(H_0\): the median salaries of Engineering majors and Physical Science Majors are the same
\(H_A\): the median salaries of Engineering majors and Physical Science Majors are not the same

I will filter out Engineering and Physical sciences majors by using ENGI and PHYS

allAgesEngineering =  filter(allStudents, grepl("ENGI",allStudents$Major))
allAgesScinces =  filter(allStudents, grepl("PHYS", allStudents$Major))

boxplot(allAgesEngineering$Median, allAgesScinces$Median, names = c("Engineering Majors", "Physical Sciences Majors"), ylab = "Median Salary (USD)")

t.test(allAgesEngineering$Median, allAgesScinces$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesEngineering$Median and allAgesScinces$Median
## t = 2.6464, df = 11.32, p-value = 0.01113
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5147.67     Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  62800.00

ks.test(allAgesEngineering$Median, allAgesScinces$Median, alternative = "less")

## Warning in ks.test(allAgesEngineering$Median, allAgesScinces$Median,
## alternative = "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesEngineering$Median and allAgesScinces$Median
## D^- = 0.46154, p-value = 0.0738
## alternative hypothesis: the CDF of x lies below that of y

Liberal Arts vs. Psycology and Social Work.

Here, the hypothesis used above is still applicable

allAgesLibArts <- filter(allStudents, grepl("ART",allStudents$Major))
allAgesPsyc  <- filter(allStudents, grepl("PSYC",allStudents$Major))

boxplot(allAgesLibArts$Median, allAgesPsyc$Median, names = c("Liberal Arts", "Psycology & Social Work"), ylab = "Median Salary (USD)")

The above box plots show that there are presence of one outlier in both accademic majors, though the one in the Liberal Arts is obviously higher than the one in the Psycology & Social Work. We will see if these outliers will have any effect on the test results later.

t.test(allAgesLibArts$Median, allAgesPsyc$Median, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesLibArts$Median and allAgesPsyc$Median
## t = -0.46875, df = 14.611, p-value = 0.6462
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8660.675  5544.008
## sample estimates:
## mean of x mean of y 
##  45441.67  47000.00

The KS test has opposite sign convention to the t test

ks.test(allAgesLibArts$Median, allAgesPsyc$Median, alternative = "two.sided")

## Warning in ks.test(allAgesLibArts$Median, allAgesPsyc$Median, alternative =
## "two.sided"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesLibArts$Median and allAgesPsyc$Median
## D = 0.25, p-value = 0.9251
## alternative hypothesis: two-sided

From these tests, I can’t establish if there is any statistical significant difference between the two majors tested above. Let’s see if the presence of those outliers observed earleir from the box plots has any effect on the test results.

subset(allAgesLibArts, Median == max(Median))

## # A tibble: 1 x 11
##   Major_code Major Major_category  Total Employed Employed_full_t~
##        <int> <chr> <chr>           <int>    <int>            <int>
## 1       5004 GEOL~ Physical Scie~ 107902    75698            59262
## # ... with 5 more variables: Unemployed <int>, Unemployment_rate <dbl>,
## #   Median <int>, P25th <int>, P75th <dbl>

subset(allAgesPsyc, Median == max(Median))

## # A tibble: 1 x 11
##   Major_code Major Major_category Total Employed Employed_full_t~
##        <int> <chr> <chr>          <int>    <int>            <int>
## 1       5205 INDU~ Psychology & ~ 17969    11878             8631
## # ... with 5 more variables: Unemployed <int>, Unemployment_rate <dbl>,
## #   Median <int>, P25th <int>, P75th <dbl>

So, the `INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY` major is the highest outlier in the two majors being compared. I will remove this major and then conduct the `t-test` and `ks-test` again to see if it makes any difference.

allAgesPsycNoOutlier  <- allAgesPsyc %>% filter(Median != max(Median)) #this removed the high outlier.

boxplot(allAgesPsycNoOutlier$Median, allAgesLibArts$Median, names = c("Psycology & Social Work", "Liberal Arts"), ylab = "Median Salary (USD)")

Conducting the tests again

t.test(allAgesLibArts$Median, allAgesPsycNoOutlier$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesLibArts$Median and allAgesPsycNoOutlier$Median
## t = 0.21743, df = 16.627, p-value = 0.4153
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -4098.13      Inf
## sample estimates:
## mean of x mean of y 
##  45441.67  44857.14

ks.test(allAgesLibArts$Median, allAgesPsycNoOutlier$Median, alternative = "less")

## Warning in ks.test(allAgesLibArts$Median, allAgesPsycNoOutlier$Median,
## alternative = "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesLibArts$Median and allAgesPsycNoOutlier$Median
## D^- = 0.083333, p-value = 0.9404
## alternative hypothesis: the CDF of x lies below that of y

The Shapiro-Wilk test tests the null hypothesis that a sample x1, …, xn came from a normally distributed population.

shapiro.test(allAgesLibArts$Median)

## 
##  Shapiro-Wilk normality test
## 
## data:  allAgesLibArts$Median
## W = 0.79833, p-value = 0.008968

shapiro.test(allAgesPsycNoOutlier$Median)

## 
##  Shapiro-Wilk normality test
## 
## data:  allAgesPsycNoOutlier$Median
## W = 0.91269, p-value = 0.4148

With the presence of the outlier, the statistical significance in p-values was 0.2231, while when it was removed, the significance difference became 0.01653. I was of the opinion that the data sets are normaly didstributed and as a matter of fact, the Shapiro-Wilk test for normality did not disaprove this. Since the Student’s T-test tends to the Normal distribution for high N, we can trust the result of the student t-test.

Engineering vs Liberal Arts majors

boxplot(allAgesEngineering$Median, allAgesLibArts$Median, names = c("Engineering", "Liberal Arts"), ylab = "Median Salary (USD)")

t.test(allAgesEngineering$Median, allAgesLibArts$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesEngineering$Median and allAgesLibArts$Median
## t = 9.522, df = 35.797, p-value = 1.197e-11
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  27385.87      Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  45441.67

ks.test(allAgesEngineering$Median, allAgesScinces$Median, alternative = "less")

## Warning in ks.test(allAgesEngineering$Median, allAgesScinces$Median,
## alternative = "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesEngineering$Median and allAgesScinces$Median
## D^- = 0.46154, p-value = 0.0738
## alternative hypothesis: the CDF of x lies below that of y

From the above, at 95% confidence Interval, Engineering majors boast higher median salary than Liberal Arts majors.

Engineering and Psycology & Social Work

boxplot(allAgesEngineering$Median, allAgesPsyc$Median, names = c("Engineering", "Psycology & Social Work"), ylab = "Median Salary (USD)")

t.test(allAgesEngineering$Median, allAgesPsyc$Median, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesEngineering$Median and allAgesPsyc$Median
## t = 8.2049, df = 23.827, p-value = 1.062e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  25112.32      Inf
## sample estimates:
## mean of x mean of y 
##  78730.77  47000.00

ks.test(allAgesEngineering$Median, allAgesPsyc$Median, alternative = "less")

## Warning in ks.test(allAgesEngineering$Median, allAgesPsyc$Median,
## alternative = "less"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesEngineering$Median and allAgesPsyc$Median
## D^- = 0.92308, p-value = 2.967e-05
## alternative hypothesis: the CDF of x lies below that of y

Liberal Arts vs Physical Sciences

boxplot(allAgesLibArts$Median, allAgesScinces$Median, names = c("Liberal Arts", "Physical Sciences"), ylab = "Median Salary USD")

t.test(allAgesLibArts$Median, allAgesScinces$Median, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  allAgesLibArts$Median and allAgesScinces$Median
## t = -3.0526, df = 9.093, p-value = 0.006787
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -6946.71
## sample estimates:
## mean of x mean of y 
##  45441.67  62800.00

ks.test(allAgesLibArts$Median, allAgesScinces$Median, alternative = "greater")

## Warning in ks.test(allAgesLibArts$Median, allAgesScinces$Median,
## alternative = "greater"): cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  allAgesLibArts$Median and allAgesScinces$Median
## D^+ = 0.70833, p-value = 0.008094
## alternative hypothesis: the CDF of x lies above that of y

Tests summary

From the tesets results, Engineering is the major that fetches the highest median salry. This is followed by Physical Science (which has the same median salary as Industrial and Organizational Psycology), Liberal Arts, Psycology & Social Work.

Linear Regression Model

For the linear regression model, I will consider the Unemployment rate and the median salary for the data sets to see if there is any correlation between unemployment rate and median salary because I have the belief that the pressure of unemployment rate can have an impact on salary because people will tend to accept what they can be offered due to low demand in their skill-set. But I want to substantiate this with data. If the residuals of the model do not show the necessary behavior of Normal Distribution and Constant Variance, we will perform a Box-Cox transformation on the data to get an exponential factor to improve the model.

First fit

allAgeUnEmp_fit <-lm(allStudents$Median ~ allStudents$Unemployment_rate)

summary(allAgeUnEmp_fit)

## 
## Call:
## lm(formula = allStudents$Median ~ allStudents$Unemployment_rate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23370  -8995  -3272   8079  64676 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      70097       3380  20.738  < 2e-16 ***
## allStudents$Unemployment_rate  -231551      55906  -4.142 5.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14060 on 171 degrees of freedom
## Multiple R-squared:  0.09117,    Adjusted R-squared:  0.08586 
## F-statistic: 17.15 on 1 and 171 DF,  p-value: 5.406e-05

ggplot(allStudents, aes(x = Unemployment_rate, y = Median)) + geom_point(color = "#440154") + geom_smooth(method = "lm", formula = y~x)

hist(allAgeUnEmp_fit$residuals, xlab = "Residuals", main = "Histogram of Linear model residuals")

plot(fitted(allAgeUnEmp_fit), residuals(allAgeUnEmp_fit))

It appears the residuals do not show characteristics of Normal Distribution and Constant Variance. Let’s try to improve the linear model through an exponential factor that will be gotten by performing a Box-Cox transformation on the linear model.

boxcox_ExpM <- boxcox(allAgeUnEmp_fit)

optimised_M <- boxcox_ExpM %>% as.data.frame() %>% .[which.max(.$y),1]

optimised_M

## [1] -1.070707

Re-fitting the model

allAgeUnEmp_refit <- lm(allStudents$Median^optimised_M ~ allStudents$Unemployment_rate)

summary(allAgeUnEmp_refit)

## 
## Call:
## lm(formula = allStudents$Median^optimised_M ~ allStudents$Unemployment_rate)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.646e-06 -1.475e-06  2.614e-07  1.295e-06  5.141e-06 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.755e-06  4.666e-07  14.476  < 2e-16 ***
## allStudents$Unemployment_rate 3.272e-05  7.718e-06   4.239 3.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.941e-06 on 171 degrees of freedom
## Multiple R-squared:  0.09509,    Adjusted R-squared:  0.0898 
## F-statistic: 17.97 on 1 and 171 DF,  p-value: 3.664e-05

hist(allAgeUnEmp_refit$residuals, xlab = "Residuals of the Refited model", main = "Histogram of Residuals of the Refited model")

plot(fitted(allAgeUnEmp_refit), residuals(allAgeUnEmp_refit))

qqnorm(residuals(allAgeUnEmp_refit))
qqline(residuals(allAgeUnEmp_refit))

allStudents_A <- allStudents %>% mutate(newMedian  = Median^optimised_M)

datatable(allStudents_A, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

ggplot(allStudents_A, aes(x = Unemployment_rate, y = newMedian)) +
  geom_point(color = '#440154')+
  geom_smooth(method = "lm", formula = y~x)

Dealing with Outliers

Removing the highest unemployment rate outlier in `allStudents_A`

allStudents_O <- allStudents_A %>% filter(Unemployment_rate > 0 & Unemployment_rate != max(Unemployment_rate))

allStudents_o_refit <- lm(allStudents_O$newMedian ~ allStudents_O$Unemployment_rate)

summary(allStudents_o_refit)

## 
## Call:
## lm(formula = allStudents_O$newMedian ~ allStudents_O$Unemployment_rate)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.616e-06 -1.475e-06  1.990e-07  1.290e-06  5.153e-06 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.611e-06  5.382e-07  12.285  < 2e-16 ***
## allStudents_O$Unemployment_rate 3.539e-05  8.999e-06   3.933 0.000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.949e-06 on 168 degrees of freedom
## Multiple R-squared:  0.08432,    Adjusted R-squared:  0.07887 
## F-statistic: 15.47 on 1 and 168 DF,  p-value: 0.0001225

hist(residuals(allStudents_o_refit))

plot(fitted(allStudents_o_refit), residuals(allStudents_o_refit))

qqnorm(resid(allStudents_o_refit))
qqline(resid(allStudents_o_refit))

ggplot(allStudents_O, aes(x = Unemployment_rate, y = newMedian)) +
  geom_point(color = '#440154')+
  geom_smooth(method = "lm", formula = y~x)

Meadian salary & Unemployment Rate

allStudentsUnEmp <- allStudents %>% dplyr::select(Major, Unemployment_rate) %>% arrange(Unemployment_rate)

datatable(allStudentsUnEmp, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

salaries <- allStudents  %>% dplyr::select(Major,Median) %>% arrange(Median)

datatable(salaries, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#440154', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Conclusion

From these analyses, it is obvious that students who took any course among the STEM majors are more likely to easily get employed and even enjoy higher salaries and there is a high tendency of this to tilt towards the male majority. This is almost how it is in Nigeria.

At the first fit, a slope of -231551 was achieved, while the Box Cox approach was used to ensure the residuals are normal and have homogeneity in variance so that it can be used for predictions. A Box Cox exponent factor of -1.071 was achieved and this shows that salary and unemployment are inversely proportional. Further efforts involved removing influencing outliers after which a p-value of 0.0001225 was achieved with an R2 of 0.08432, showing that unemployment rate can explain only about 8.432% of the variability in median salary. This means that. Therefore, it is very important that when seeking admission, prospective students should consider the impact their intended majors will have on their chances of landing deserving jobs as quick as possible after graduation.

References

The Economic Guide To Picking A College Major

As More Attend College, Majors Become More Career-Focused

FiveThirtyEight Github Repo: college majors

How to Interpret a Regression Model with Low R-squared and Low P values