College Graduates Salary Analysis

As a student choosing what college to attend and what you want to study is a difficult decision. Where you choose to go to college, what type of degree you decide to get, and the major you choose to study can impact your future salary potential. I always encourage people to purchase what their passion about, but understanding what your earning power is likely to be post-graduation is crucial in order to make a sound financial decision when selecting a college and the major you want to study. This Post-College salaries exploration research I hope will help students make their decision and have a better expectation after graduation.

General Project Plan

1.Choose and decide data sources
2.Acquire dataset
3.Data transformation and preparation
4.Analyze the data
5.Conclusions

1.Data Sources

We obtained data from two different sources:
1.Wall Street Journal
2.U.S. DEPARTMENT OF EDUCATION.

2.Acquring Data

#Load necessary packages
library(tidyverse)
library(DT)
library(jsonlite)
library(Rmisc)

2.1 Read and parse JSON format files from U.S. Department of Education API

df_api = NULL
for(i in 0:100) {
  url_api <-paste0("https://api.data.gov/ed/collegescorecard/v1/schools?fields=school.name,latest.earnings.6_yrs_after_entry.mean_earnings.lowest_tercile,latest.earnings.6_yrs_after_entry.mean_earnings.middle_tercile,latest.earnings.6_yrs_after_entry.mean_earnings.highest_tercile&page=",
                   i,"&api_key=gHHgffzGEVcJZU17NEZzdgkCJ50xqfJ0y3S0gQBH")
  ed_api <- fromJSON(url_api)
  df_api <- rbind(df_api,ed_api$results)
}

2.2 Read three WSJ CSV files which were uploaded to Github.

# salaries by degree
url_deg <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/degrees-that-pay-back.csv"
df_deg <- read_csv(url_deg,
                   col_names = c("major", "start_med_slry", "mid_car_slry",
                                 "percent_chng", "mid_car_10th", "mid_car_25th",
                                 "mid_car_75th", "mid_car_90th"),  # rename the column names
                   skip = 1)  # names specified, skip header

# salaries by college type
url_col <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/salaries-by-college-type.csv"
df_col <- read_csv(url_col,
                   col_names = c("school_name", "school_type", "start_med_slry",
                                 "mid_car_slry", "mid_car_10th", "mid_car_25th",
                                 "mid_car_75th", "mid_car_90th"),
                   skip = 1)

# salaries by region
url_reg <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/salaries-by-region.csv"
df_reg <- read_csv(url_reg,
                   col_names = c("school_name", "region", "start_med_slry",
                                 "mid_car_slry", "mid_car_10th", "mid_car_25th",
                                 "mid_car_75th", "mid_car_90th"),
                   skip = 1)

3. Data Tidying and Wrangling

3.1 Data Structure

First we looked at the structure of each dataset.

glimpse(df_api)

## Observations: 2,020
## Variables: 4
## $ school.name                                                     <chr> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.lowest_tercile  <dbl> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.middle_tercile  <dbl> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.highest_tercile <dbl> ...

glimpse(df_deg)

## Observations: 50
## Variables: 8
## $ major          <chr> "Accounting", "Aerospace Engineering", "Agricul...
## $ start_med_slry <chr> "$46,000.00", "$57,700.00", "$42,600.00", "$36,...
## $ mid_car_slry   <chr> "$77,100.00", "$101,000.00", "$71,900.00", "$61...
## $ percent_chng   <dbl> 67.6, 75.0, 68.8, 67.1, 84.6, 81.3, 67.0, 67.7,...
## $ mid_car_10th   <chr> "$42,200.00", "$64,300.00", "$36,300.00", "$33,...
## $ mid_car_25th   <chr> "$56,100.00", "$82,100.00", "$52,100.00", "$45,...
## $ mid_car_75th   <chr> "$108,000.00", "$127,000.00", "$96,300.00", "$8...
## $ mid_car_90th   <chr> "$152,000.00", "$161,000.00", "$150,000.00", "$...

glimpse(df_col)

## Observations: 269
## Variables: 8
## $ school_name    <chr> "Massachusetts Institute of Technology (MIT)", ...
## $ school_type    <chr> "Engineering", "Engineering", "Engineering", "E...
## $ start_med_slry <chr> "$72,200.00", "$75,500.00", "$71,800.00", "$62,...
## $ mid_car_slry   <chr> "$126,000.00", "$123,000.00", "$122,000.00", "$...
## $ mid_car_10th   <chr> "$76,800.00", "N/A", "N/A", "$66,800.00", "N/A"...
## $ mid_car_25th   <chr> "$99,200.00", "$104,000.00", "$96,000.00", "$94...
## $ mid_car_75th   <chr> "$168,000.00", "$161,000.00", "$180,000.00", "$...
## $ mid_car_90th   <chr> "$220,000.00", "N/A", "N/A", "$190,000.00", "N/...

glimpse(df_reg)

## Observations: 320
## Variables: 8
## $ school_name    <chr> "Stanford University", "California Institute of...
## $ region         <chr> "California", "California", "California", "Cali...
## $ start_med_slry <chr> "$70,400.00", "$75,500.00", "$71,800.00", "$59,...
## $ mid_car_slry   <chr> "$129,000.00", "$123,000.00", "$122,000.00", "$...
## $ mid_car_10th   <chr> "$68,400.00", "N/A", "N/A", "$59,500.00", "N/A"...
## $ mid_car_25th   <chr> "$93,100.00", "$104,000.00", "$96,000.00", "$81...
## $ mid_car_75th   <chr> "$184,000.00", "$161,000.00", "$180,000.00", "$...
## $ mid_car_90th   <chr> "$257,000.00", "N/A", "N/A", "$201,000.00", "N/...

3.2 Data Transformation

3.2.1 Data Types

We removed the “$” in the numbers and changed the data types from character to numeric in each dataset. And we changed the column names in the df_api dataset.

#salaries by degree
df_deg$start_med_slry <- parse_number(df_deg$start_med_slry)
df_deg$mid_car_slry <- parse_number(df_deg$mid_car_slry)
df_deg$mid_car_10th <- parse_number(df_deg$mid_car_10th)
df_deg$mid_car_25th <- parse_number(df_deg$mid_car_25th)
df_deg$mid_car_75th <- parse_number(df_deg$mid_car_75th)
df_deg$mid_car_90th <- parse_number(df_deg$mid_car_90th)
datatable(df_deg)

#salaries by college type
df_col$start_med_slry <- parse_number(df_col$start_med_slry)
df_col$mid_car_slry <- parse_number(df_col$mid_car_slry)
df_col$mid_car_10th <- parse_number(df_col$mid_car_10th)
df_col$mid_car_25th <- parse_number(df_col$mid_car_25th)
df_col$mid_car_75th <- parse_number(df_col$mid_car_75th)
df_col$mid_car_90th <- parse_number(df_col$mid_car_90th)
datatable(df_col)

#salaries by region
df_reg$start_med_slry <- parse_number(df_reg$start_med_slry)
df_reg$mid_car_slry <- parse_number(df_reg$mid_car_slry)
df_reg$mid_car_10th <- parse_number(df_reg$mid_car_10th)
df_reg$mid_car_25th <- parse_number(df_reg$mid_car_25th)
df_reg$mid_car_75th <- parse_number(df_reg$mid_car_75th)
df_reg$mid_car_90th <- parse_number(df_reg$mid_car_90th)
datatable(df_reg)

names(df_api) <-c("school","median_6yr","median_8yr","median_10yr")
head(df_api)

##                                             school median_6yr median_8yr
## 1                    New York Theological Seminary         NA         NA
## 2                             Carver Bible College         NA         NA
## 3                       Strayer University-Florida      26600      43400
## 4                      Strayer University-Delaware      26600      43400
## 5 European Academy of Cosmetology and Hairdressing         NA         NA
## 6                            Tenaj Salon Institute         NA         NA
##   median_10yr
## 1          NA
## 2          NA
## 3       63600
## 4       63600
## 5          NA
## 6          NA

4. Exploratory Analysis

4.1 Median salary analysis in the dataset from EDU

df_api_median <- df_api %>%
  select(median_6yr,median_8yr,median_10yr) %>%
  summarise_all(.,funs(median),na.rm = TRUE)
datatable(df_api_median)

4.2 Starting vs Mid-career salaries distribution

#data transformation from wide to long
start_vs_mid <- df_reg%>%
                 select(start_med_slry, mid_car_slry)%>%
                 gather(career, salary)%>%
                 mutate(career=as_factor(career,fct_rev))
head(start_vs_mid)

## # A tibble: 6 x 2
##   career         salary
##   <fct>           <dbl>
## 1 start_med_slry  70400
## 2 start_med_slry  75500
## 3 start_med_slry  71800
## 4 start_med_slry  59900
## 5 start_med_slry  51900
## 6 start_med_slry  57200

We ploted the dataset into histogram to analyze the distribution between starting salary and mid-career salary.

ggplot(start_vs_mid, aes(salary, fill=career))+
  geom_histogram(position ="dodge")

Median of starting and mid-career salary

start_med<- start_vs_mid%>%
  filter(career=="start_med_slry")%>%
  summarize(median(salary))
start_med

##   median(salary)
## 1          45100

mid_med<- start_vs_mid%>%
  filter(career=="mid_car_slry")%>%
  summarize(median(salary))
mid_med

##   median(salary)
## 1          82700

The distribution for starting median salary is definitely concentrated at the lower range of salaries and is right-skewed. Graduates of most schools start out with median of $45,000. Towards mid-career, the distribution of median salaries becomes more dispersed and the median increases to $82,000.

4.3 Correlation between starting and mid-career salaries

From the scatter plot above, we can see that there is a strong positive correlaton between the starting salaries and the mid-career salaries.

ggplot(df_reg, aes(start_med_slry,mid_car_slry))+
  geom_point()+
  geom_smooth()

4.4 Salary by College Location Analysis

This is the salaries by college region data set. We will look at the distribution across regions.

First of all, Let us have a look the summary starting salary at the df_region dataset:

summary(df_reg$start_med_slry)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34500   42000   45100   46253   48900   75500

Let’s see if there is any difference in starting or mid-career salary by region.

p1<-ggplot(df_reg,aes(x=region,y=start_med_slry))+
         geom_boxplot(aes(fill=region))+
         ggtitle('Starting Median Salaries by Region')+
         xlab('College Location')+
         ylab('Salary ')

p2<-ggplot(df_reg,aes(x=region,y=mid_car_slry))+
         geom_boxplot(aes(fill=region))+
         ggtitle('Mid-Career Median Salaries by Region')+
         xlab('College Location')+
         ylab('Salary ')

multiplot(p1, p2, layout=matrix(c(1, 2), 2, 1, byrow=TRUE))

We can see that California and the Northeastern region appear to have higher starting and mid-career salaries.

4.5 Salary by College Type Analysis

4.5.1

First,we will take a look at how many types of school and count the numbers of each.

ggplot(df_col, aes(school_type)) +
  geom_bar()

From the plot above, we noticed that most of the schools are state schools.

4.5.2 Starting vs mid-career salaries analysis by college type

df_col_type_slry <- df_col %>%
  select(school_type, start_med_slry, mid_car_slry) %>%
  gather(timeline, salary, start_med_slry:mid_car_slry) %>%
  mutate(timeline = as_factor(timeline, fct_rev))

ggplot(df_col_type_slry, aes(reorder(school_type, salary), salary, fill = timeline)) +
  scale_color_manual(values = c("blue", "pink")) +
  geom_boxplot(alpha = 0.5) +
  scale_fill_manual(values = c("blue", "pink")) +
  theme(legend.position = "right") +
  xlab('School type')

Both engineering and ivy league schools have higher starting and mid-career median salaries.

4.5.3 Statisical Analysis

Let’s run a linear regression model for median salary and school types

summary(lm(formula = start_med_slry ~ school_type, data=df_col))

## 
## Call:
## lm(formula = start_med_slry ~ school_type, data = df_col)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12857.9  -3026.3   -526.3   2373.7  16442.1 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                59058       1046  56.466   <2e-16 ***
## school_typeIvy League       1417       1921   0.738    0.461    
## school_typeLiberal Arts   -13311       1239 -10.740   <2e-16 ***
## school_typeParty          -13343       1460  -9.136   <2e-16 ***
## school_typeState          -14932       1101 -13.559   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4559 on 264 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.4946 
## F-statistic: 66.56 on 4 and 264 DF,  p-value: < 2.2e-16

The intercept represents the starting salary at $59,058. The estimate predict from Ivy school will increse $1,417 , others type of schools will have a negative affect to the salary.

4.5.4 Top 10 schools by type

Top 10 schools by median starting salary

df_start_top10 <- df_col %>%
  select(school_name, school_type, start_med_slry) %>%
  arrange(desc(start_med_slry)) %>%
  top_n(10)

ggplot(df_start_top10, aes(reorder(school_name, start_med_slry), start_med_slry,fill = school_type)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = start_med_slry), hjust = 1.1, color = 'gray30') +
  xlab(NULL) +
  coord_flip()

In the top 10 list, we can see engineering and ivy league have the highest median salary in starting salary.

4.6 Salary by Majores Analysis

4.6.1 Starting median salary by major

ggplot(df_deg, aes(x = reorder(major, start_med_slry), start_med_slry)) +
  geom_col(fill = "blue", alpha = 0.5) +
  geom_col(aes(x = reorder(major, mid_car_slry), mid_car_slry), alpha = 0.2) +
  geom_text(aes(label = start_med_slry), size = 3, hjust = 1.1) +
  xlab(NULL) +
  coord_flip() +
  ggtitle("Starting Salary by Majors")

From the plot above, we can see that Physician has the highest median starting salaries. In the top ten highest median starting salaries, most of them are engineering and computer science related majors.

4.6.2 Starting and mid-career median salary by major

ggplot(df_deg, aes(x = reorder(major, mid_car_slry), mid_car_slry)) +
  geom_col(fill = 'pink',alpha = 0.5) +
  geom_col(aes(x = reorder(major, mid_car_slry), start_med_slry), alpha = 0.4) +
  geom_text(aes(label = mid_car_slry), size = 3, hjust = 1.1) +
  scale_fill_manual(values = c('blue', 'pink')) +
  xlab(NULL) +
  coord_flip() +
  ggtitle("Mid-career Salary by Major")

From the plot above, we can see that all of the majors have highest median starting salaries have high salaries in mid-career.

5. Conclusion

Chosing which colleges to attend and career choices are always a difficult decison, we should consider many factors including post-graduate salary. We found out that salaries differed a lot by school type and degree, but not much by location. Based on our study, studuents graduated in STEM major from Ivy or Engineering schools located in California or Northeast have the highest salary.

607 Final Project

Weijian Cai, Qixing Li

May 6, 2019