The Economic value of college majors in the US.

Data Preparation

1. All Students

allStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/all-ages.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)
datatable(allStudents, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

2. Graduate Students attending graduate schools

gradStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/grad-students.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)
datatable(gradStudents, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

3. Recent Graduates

recentGraduates <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/recent-grads.csv', sep=",",  header=T, stringsAsFactors = FALSE) %>% tbl_df() %>%  arrange(Major_category)
datatable(recentGraduates, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What college majors offer the best employment opportunities and salaries in the US?

Cases

What are the cases, and how many are there?

dim(allStudents)
## [1] 173  11
dim(recentGraduates)
## [1] 173  21
dim(gradStudents)
## [1] 173  22
There are 173 cases and 11 variables (columns) in the all-ages data set, it contains basic earnings and labor force information for all ages. The grad-students (ages 25+) data set contains 173 cases and 22 variables, while the recent-grads (ages <28) data set contains 173 cases and 21 variables. It contains a more detailed breakdown, including by sex and by the type of job they got

Data collection

Describe the method of data collection.

This data set is a survey data collected by the American Community Survey 2010-2012 Public Use Microdata Series (PUMS).
I read a 2014 article: The Economic Guide To Picking A College Major by fivethirtyeight about how college majors affect employment and unemployment rates in the US, and decided to have a closer look on this using a readily available data set that has been collected to help explain the situation in an exploratory and analytical way.

Type of study

What type of study is this (observational/experiment)?

This is an observational study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

This was pulled from the ACS website by Fivethirtyeigth and stored in their Github Repo after performing some categorising on the cases

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variables is College Major. It is a categorical variable

Explanatory

What is the explanatory variable(s), and what type is it (numerical/categorival)?

The explanatory variables include:

  1. The counts of employed and unemployed degree holders and
  2. The statistics of their income.

These variables are numerical

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

allStudents

str(allStudents)
## Classes 'tbl_df', 'tbl' and 'data.frame':    173 obs. of  11 variables:
##  $ Major_code                   : int  1100 1101 1102 1103 1104 1105 1106 1199 1302 1303 ...
##  $ Major                        : chr  "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
##  $ Major_category               : chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
##  $ Total                        : int  128148 95326 33955 103549 24280 79409 6586 8549 69447 83188 ...
##  $ Employed                     : int  90245 76865 26321 81177 17281 63043 4926 6392 48228 65937 ...
##  $ Employed_full_time_year_round: int  74078 64240 22810 64937 12722 51077 4042 5074 39613 50595 ...
##  $ Unemployed                   : int  2423 2266 821 3619 894 2070 264 261 2144 3789 ...
##  $ Unemployment_rate            : num  0.0261 0.0286 0.0302 0.0427 0.0492 ...
##  $ Median                       : int  50000 54000 63000 46000 62000 50000 63000 52000 58000 52000 ...
##  $ P25th                        : int  34000 36000 40000 30000 38500 35000 39400 35000 40500 37100 ...
##  $ P75th                        : num  80000 80000 98000 72000 90000 75000 88000 75000 80000 75000 ...
summary(allStudents)
##    Major_code      Major           Major_category         Total        
##  Min.   :1100   Length:173         Length:173         Min.   :   2396  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  24280  
##  Median :3608   Mode  :character   Mode  :character   Median :  75791  
##  Mean   :3880                                         Mean   : 230257  
##  3rd Qu.:5503                                         3rd Qu.: 205763  
##  Max.   :6403                                         Max.   :3123510  
##     Employed       Employed_full_time_year_round   Unemployed    
##  Min.   :   1492   Min.   :   1093               Min.   :     0  
##  1st Qu.:  17281   1st Qu.:  12722               1st Qu.:  1101  
##  Median :  56564   Median :  39613               Median :  3619  
##  Mean   : 166162   Mean   : 126308               Mean   :  9725  
##  3rd Qu.: 142879   3rd Qu.: 111025               3rd Qu.:  8862  
##  Max.   :2354398   Max.   :1939384               Max.   :147261  
##  Unemployment_rate     Median           P25th           P75th       
##  Min.   :0.00000   Min.   : 35000   Min.   :24900   Min.   : 45800  
##  1st Qu.:0.04626   1st Qu.: 46000   1st Qu.:32000   1st Qu.: 70000  
##  Median :0.05472   Median : 53000   Median :36000   Median : 80000  
##  Mean   :0.05736   Mean   : 56816   Mean   :38697   Mean   : 82506  
##  3rd Qu.:0.06904   3rd Qu.: 65000   3rd Qu.:42000   3rd Qu.: 95000  
##  Max.   :0.15615   Max.   :125000   Max.   :78000   Max.   :210000

gradStudents

str(gradStudents)
## Classes 'tbl_df', 'tbl' and 'data.frame':    173 obs. of  22 variables:
##  $ Major_code                  : int  1101 1100 1302 1303 1105 1102 1106 1103 1199 1104 ...
##  $ Major                       : chr  "AGRICULTURE PRODUCTION AND MANAGEMENT" "GENERAL AGRICULTURE" "FORESTRY" "NATURAL RESOURCES MANAGEMENT" ...
##  $ Major_category              : chr  "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
##  $ Grad_total                  : int  17488 44306 24713 29357 30983 14800 3335 56807 5032 14521 ...
##  $ Grad_sample_size            : int  386 764 487 659 624 305 61 1335 98 266 ...
##  $ Grad_employed               : int  13104 28930 16831 23394 22782 10592 2284 47755 2758 10857 ...
##  $ Grad_full_time_year_round   : int  11207 23024 14102 19087 18312 8768 1641 39047 2276 8074 ...
##  $ Grad_unemployed             : int  473 874 725 711 735 216 34 596 261 370 ...
##  $ Grad_unemployment_rate      : num  0.0348 0.0293 0.0413 0.0295 0.0313 ...
##  $ Grad_median                 : num  67000 68000 78000 70000 67000 80000 65000 70300 54000 72000 ...
##  $ Grad_P25                    : int  41600 45000 52000 50000 45000 53000 50000 48000 45000 50000 ...
##  $ Grad_P75                    : num  100000 104000 110000 100000 100000 120000 91000 104000 81000 110000 ...
##  $ Nongrad_total               : int  89169 123984 67649 77101 76190 33049 6242 94910 8092 22853 ...
##  $ Nongrad_employed            : int  71781 86631 46815 60690 60241 25557 4654 74896 5978 16298 ...
##  $ Nongrad_full_time_year_round: int  61335 72409 39048 48256 49506 22496 3917 61629 4707 12431 ...
##  $ Nongrad_unemployed          : int  1869 2352 1885 3413 1899 734 264 3101 239 681 ...
##  $ Nongrad_unemployment_rate   : num  0.0254 0.0264 0.0387 0.0532 0.0306 ...
##  $ Nongrad_median              : num  55000 50000 59000 53000 50000 63000 65000 48000 55000 63000 ...
##  $ Nongrad_P25                 : int  38000 34000 42000 38000 35000 40000 41000 32000 39000 40000 ...
##  $ Nongrad_P75                 : num  80000 80000 80000 75000 75000 99000 89000 75000 78000 92000 ...
##  $ Grad_share                  : num  0.164 0.263 0.268 0.276 0.289 ...
##  $ Grad_premium                : num  0.218 0.36 0.322 0.321 0.34 ...
summary(gradStudents)
##    Major_code      Major           Major_category       Grad_total     
##  Min.   :1100   Length:173         Length:173         Min.   :   1542  
##  1st Qu.:2403   Class :character   Class :character   1st Qu.:  15284  
##  Median :3608   Mode  :character   Mode  :character   Median :  37872  
##  Mean   :3880                                         Mean   : 127672  
##  3rd Qu.:5503                                         3rd Qu.: 148255  
##  Max.   :6403                                         Max.   :1184158  
##  Grad_sample_size Grad_employed    Grad_full_time_year_round
##  Min.   :   22    Min.   :  1008   Min.   :   770           
##  1st Qu.:  314    1st Qu.: 12659   1st Qu.:  9894           
##  Median :  688    Median : 28930   Median : 22523           
##  Mean   : 2251    Mean   : 94037   Mean   : 72861           
##  3rd Qu.: 2528    3rd Qu.:109944   3rd Qu.: 80794           
##  Max.   :21994    Max.   :915341   Max.   :703347           
##  Grad_unemployed Grad_unemployment_rate  Grad_median        Grad_P25    
##  Min.   :    0   Min.   :0.00000        Min.   : 47000   Min.   :24500  
##  1st Qu.:  453   1st Qu.:0.02607        1st Qu.: 65000   1st Qu.:45000  
##  Median : 1179   Median :0.03665        Median : 75000   Median :50000  
##  Mean   : 3506   Mean   :0.03934        Mean   : 76756   Mean   :52597  
##  3rd Qu.: 3329   3rd Qu.:0.04805        3rd Qu.: 90000   3rd Qu.:60000  
##  Max.   :35718   Max.   :0.13851        Max.   :135000   Max.   :85000  
##     Grad_P75      Nongrad_total     Nongrad_employed 
##  Min.   : 65000   Min.   :   2232   Min.   :   1328  
##  1st Qu.: 93000   1st Qu.:  20564   1st Qu.:  15914  
##  Median :108000   Median :  68993   Median :  50092  
##  Mean   :112087   Mean   : 214720   Mean   : 154554  
##  3rd Qu.:130000   3rd Qu.: 184971   3rd Qu.: 129179  
##  Max.   :294000   Max.   :2996892   Max.   :2253649  
##  Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate
##  Min.   :    980              Min.   :     0     Min.   :0.00000          
##  1st Qu.:  11755              1st Qu.:   880     1st Qu.:0.04198          
##  Median :  38384              Median :  3157     Median :0.05103          
##  Mean   : 120737              Mean   :  8486     Mean   :0.05395          
##  3rd Qu.: 103629              3rd Qu.:  7409     3rd Qu.:0.06439          
##  Max.   :1882507              Max.   :136978     Max.   :0.16091          
##  Nongrad_median    Nongrad_P25     Nongrad_P75       Grad_share     
##  Min.   : 37000   Min.   :25000   Min.   : 48000   Min.   :0.09632  
##  1st Qu.: 48700   1st Qu.:34000   1st Qu.: 72000   1st Qu.:0.26757  
##  Median : 55000   Median :38000   Median : 80000   Median :0.39875  
##  Mean   : 58584   Mean   :40078   Mean   : 84333   Mean   :0.40059  
##  3rd Qu.: 65000   3rd Qu.:44000   3rd Qu.: 97000   3rd Qu.:0.49912  
##  Max.   :126000   Max.   :80000   Max.   :215000   Max.   :0.93117  
##   Grad_premium    
##  Min.   :-0.0250  
##  1st Qu.: 0.2308  
##  Median : 0.3208  
##  Mean   : 0.3285  
##  3rd Qu.: 0.4000  
##  Max.   : 1.6471

Accademic Majors Categories distribution

Let’s see how the Accademic Majors Categories are distributed by subsetting the Major_category variable from the all-ages data set:

categories <- as.data.frame(table(allStudents$Major_category))
names(categories) <- c('category', 'frequency')
ggplot(categories, aes(x= reorder(category, frequency), y=frequency)) + 
  geom_bar(stat = "identity", fill = "steelblue") +
 xlab("Accademic Major") + ylab("Frequency") +
  ggtitle("Accademic majors Categories vs Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x=element_blank(),
        plot.title = element_text(color="black", size=14,hjust = 0.5)) + coord_flip() + theme(legend.position="none")

So, it appears engineering students tend to be in a better position to land jobs easily as the number of jobs demanding their skills seem to be high

Salary distribution for all ages

ggplot(allStudents, aes(x=Median)) + 
  ggtitle("All ages' Median Salary") +
  geom_histogram(fill = "steelblue", color='white', binwidth = 10000) +
  theme(axis.title.x=element_blank(), 
        plot.title = element_text(color="black", size=14,hjust = 0.5))

The salary distribution is right skewed which is andicator that highly paying jobs are few and or the chances to land them is very slim.

Distribution of all unemployed students

all <- allStudents$Unemployment_rate
recent <- recentGraduates$Unemployment_rate
graduates <- gradStudents$Grad_unemployment_rate

allUnemployedRate <- cbind(all, recent, graduates)

barplot(allUnemployedRate/nrow(allUnemployedRate), names.arg = c('All Students', 'Recent Graduates', 'Graduate Students'), xlab = "Unemployment Rate", col = heat.colors(nrow(allUnemployedRate)))

Graduate students have better chances of landing jobs than recent/fresh graduates. It is obvious that organisations tend to go more for experienced workforce than otherwise.