As a student choosing what college to attend and what you want to study is a difficult decision. Where you choose to go to college, what type of degree you decide to get, and the major you choose to study can impact your future salary potential. I always encourage people to purchase what their passion about, but understanding what your earning power is likely to be post-graduation is crucial in order to make a sound financial decision when selecting a college and the major you want to study. This Post-College salaries exploration research I hope will help students make their decision and have a better expectation after graduation.
1.Choose and decide data sources
2.Acquire dataset
3.Data transformation and preparation
4.Analyze the data
5.Conclusions
We obtained data from two different sources:
1.Wall Street Journal
2.U.S. DEPARTMENT OF EDUCATION.
#Load necessary packages
library(tidyverse)
library(DT)
library(jsonlite)
library(Rmisc)2.1 Read and parse JSON format files from U.S. Department of Education API
df_api = NULL
for(i in 0:100) {
url_api <-paste0("https://api.data.gov/ed/collegescorecard/v1/schools?fields=school.name,latest.earnings.6_yrs_after_entry.mean_earnings.lowest_tercile,latest.earnings.6_yrs_after_entry.mean_earnings.middle_tercile,latest.earnings.6_yrs_after_entry.mean_earnings.highest_tercile&page=",
i,"&api_key=gHHgffzGEVcJZU17NEZzdgkCJ50xqfJ0y3S0gQBH")
ed_api <- fromJSON(url_api)
df_api <- rbind(df_api,ed_api$results)
}2.2 Read three WSJ CSV files which were uploaded to Github.
# salaries by degree
url_deg <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/degrees-that-pay-back.csv"
df_deg <- read_csv(url_deg,
col_names = c("major", "start_med_slry", "mid_car_slry",
"percent_chng", "mid_car_10th", "mid_car_25th",
"mid_car_75th", "mid_car_90th"), # rename the column names
skip = 1) # names specified, skip header
# salaries by college type
url_col <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/salaries-by-college-type.csv"
df_col <- read_csv(url_col,
col_names = c("school_name", "school_type", "start_med_slry",
"mid_car_slry", "mid_car_10th", "mid_car_25th",
"mid_car_75th", "mid_car_90th"),
skip = 1)
# salaries by region
url_reg <- "https://raw.githubusercontent.com/qixing810/CUNYSPS-DataScience/master/DS607/Project/final%20project/salaries-by-region.csv"
df_reg <- read_csv(url_reg,
col_names = c("school_name", "region", "start_med_slry",
"mid_car_slry", "mid_car_10th", "mid_car_25th",
"mid_car_75th", "mid_car_90th"),
skip = 1)First we looked at the structure of each dataset.
glimpse(df_api)## Observations: 2,020
## Variables: 4
## $ school.name <chr> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.lowest_tercile <dbl> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.middle_tercile <dbl> ...
## $ latest.earnings.6_yrs_after_entry.mean_earnings.highest_tercile <dbl> ...
glimpse(df_deg)## Observations: 50
## Variables: 8
## $ major <chr> "Accounting", "Aerospace Engineering", "Agricul...
## $ start_med_slry <chr> "$46,000.00", "$57,700.00", "$42,600.00", "$36,...
## $ mid_car_slry <chr> "$77,100.00", "$101,000.00", "$71,900.00", "$61...
## $ percent_chng <dbl> 67.6, 75.0, 68.8, 67.1, 84.6, 81.3, 67.0, 67.7,...
## $ mid_car_10th <chr> "$42,200.00", "$64,300.00", "$36,300.00", "$33,...
## $ mid_car_25th <chr> "$56,100.00", "$82,100.00", "$52,100.00", "$45,...
## $ mid_car_75th <chr> "$108,000.00", "$127,000.00", "$96,300.00", "$8...
## $ mid_car_90th <chr> "$152,000.00", "$161,000.00", "$150,000.00", "$...
glimpse(df_col)## Observations: 269
## Variables: 8
## $ school_name <chr> "Massachusetts Institute of Technology (MIT)", ...
## $ school_type <chr> "Engineering", "Engineering", "Engineering", "E...
## $ start_med_slry <chr> "$72,200.00", "$75,500.00", "$71,800.00", "$62,...
## $ mid_car_slry <chr> "$126,000.00", "$123,000.00", "$122,000.00", "$...
## $ mid_car_10th <chr> "$76,800.00", "N/A", "N/A", "$66,800.00", "N/A"...
## $ mid_car_25th <chr> "$99,200.00", "$104,000.00", "$96,000.00", "$94...
## $ mid_car_75th <chr> "$168,000.00", "$161,000.00", "$180,000.00", "$...
## $ mid_car_90th <chr> "$220,000.00", "N/A", "N/A", "$190,000.00", "N/...
glimpse(df_reg)## Observations: 320
## Variables: 8
## $ school_name <chr> "Stanford University", "California Institute of...
## $ region <chr> "California", "California", "California", "Cali...
## $ start_med_slry <chr> "$70,400.00", "$75,500.00", "$71,800.00", "$59,...
## $ mid_car_slry <chr> "$129,000.00", "$123,000.00", "$122,000.00", "$...
## $ mid_car_10th <chr> "$68,400.00", "N/A", "N/A", "$59,500.00", "N/A"...
## $ mid_car_25th <chr> "$93,100.00", "$104,000.00", "$96,000.00", "$81...
## $ mid_car_75th <chr> "$184,000.00", "$161,000.00", "$180,000.00", "$...
## $ mid_car_90th <chr> "$257,000.00", "N/A", "N/A", "$201,000.00", "N/...
We removed the “$” in the numbers and changed the data types from character to numeric in each dataset. And we changed the column names in the df_api dataset.
#salaries by degree
df_deg$start_med_slry <- parse_number(df_deg$start_med_slry)
df_deg$mid_car_slry <- parse_number(df_deg$mid_car_slry)
df_deg$mid_car_10th <- parse_number(df_deg$mid_car_10th)
df_deg$mid_car_25th <- parse_number(df_deg$mid_car_25th)
df_deg$mid_car_75th <- parse_number(df_deg$mid_car_75th)
df_deg$mid_car_90th <- parse_number(df_deg$mid_car_90th)
datatable(df_deg)#salaries by college type
df_col$start_med_slry <- parse_number(df_col$start_med_slry)
df_col$mid_car_slry <- parse_number(df_col$mid_car_slry)
df_col$mid_car_10th <- parse_number(df_col$mid_car_10th)
df_col$mid_car_25th <- parse_number(df_col$mid_car_25th)
df_col$mid_car_75th <- parse_number(df_col$mid_car_75th)
df_col$mid_car_90th <- parse_number(df_col$mid_car_90th)
datatable(df_col)#salaries by region
df_reg$start_med_slry <- parse_number(df_reg$start_med_slry)
df_reg$mid_car_slry <- parse_number(df_reg$mid_car_slry)
df_reg$mid_car_10th <- parse_number(df_reg$mid_car_10th)
df_reg$mid_car_25th <- parse_number(df_reg$mid_car_25th)
df_reg$mid_car_75th <- parse_number(df_reg$mid_car_75th)
df_reg$mid_car_90th <- parse_number(df_reg$mid_car_90th)
datatable(df_reg)names(df_api) <-c("school","median_6yr","median_8yr","median_10yr")
head(df_api)## school median_6yr median_8yr
## 1 New York Theological Seminary NA NA
## 2 Carver Bible College NA NA
## 3 Strayer University-Florida 26600 43400
## 4 Strayer University-Delaware 26600 43400
## 5 European Academy of Cosmetology and Hairdressing NA NA
## 6 Tenaj Salon Institute NA NA
## median_10yr
## 1 NA
## 2 NA
## 3 63600
## 4 63600
## 5 NA
## 6 NA
df_api_median <- df_api %>%
select(median_6yr,median_8yr,median_10yr) %>%
summarise_all(.,funs(median),na.rm = TRUE)
datatable(df_api_median)#data transformation from wide to long
start_vs_mid <- df_reg%>%
select(start_med_slry, mid_car_slry)%>%
gather(career, salary)%>%
mutate(career=as_factor(career,fct_rev))
head(start_vs_mid)## # A tibble: 6 x 2
## career salary
## <fct> <dbl>
## 1 start_med_slry 70400
## 2 start_med_slry 75500
## 3 start_med_slry 71800
## 4 start_med_slry 59900
## 5 start_med_slry 51900
## 6 start_med_slry 57200
We ploted the dataset into histogram to analyze the distribution between starting salary and mid-career salary.
ggplot(start_vs_mid, aes(salary, fill=career))+
geom_histogram(position ="dodge")Median of starting and mid-career salary
start_med<- start_vs_mid%>%
filter(career=="start_med_slry")%>%
summarize(median(salary))
start_med## median(salary)
## 1 45100
mid_med<- start_vs_mid%>%
filter(career=="mid_car_slry")%>%
summarize(median(salary))
mid_med## median(salary)
## 1 82700
The distribution for starting median salary is definitely concentrated at the lower range of salaries and is right-skewed. Graduates of most schools start out with median of $45,000. Towards mid-career, the distribution of median salaries becomes more dispersed and the median increases to $82,000.
From the scatter plot above, we can see that there is a strong positive correlaton between the starting salaries and the mid-career salaries.
ggplot(df_reg, aes(start_med_slry,mid_car_slry))+
geom_point()+
geom_smooth()This is the salaries by college region data set. We will look at the distribution across regions.
First of all, Let us have a look the summary starting salary at the df_region dataset:
summary(df_reg$start_med_slry)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34500 42000 45100 46253 48900 75500
Let’s see if there is any difference in starting or mid-career salary by region.
p1<-ggplot(df_reg,aes(x=region,y=start_med_slry))+
geom_boxplot(aes(fill=region))+
ggtitle('Starting Median Salaries by Region')+
xlab('College Location')+
ylab('Salary ')
p2<-ggplot(df_reg,aes(x=region,y=mid_car_slry))+
geom_boxplot(aes(fill=region))+
ggtitle('Mid-Career Median Salaries by Region')+
xlab('College Location')+
ylab('Salary ')
multiplot(p1, p2, layout=matrix(c(1, 2), 2, 1, byrow=TRUE))We can see that California and the Northeastern region appear to have higher starting and mid-career salaries.
First,we will take a look at how many types of school and count the numbers of each.
ggplot(df_col, aes(school_type)) +
geom_bar() From the plot above, we noticed that most of the schools are state schools.
df_col_type_slry <- df_col %>%
select(school_type, start_med_slry, mid_car_slry) %>%
gather(timeline, salary, start_med_slry:mid_car_slry) %>%
mutate(timeline = as_factor(timeline, fct_rev))
ggplot(df_col_type_slry, aes(reorder(school_type, salary), salary, fill = timeline)) +
scale_color_manual(values = c("blue", "pink")) +
geom_boxplot(alpha = 0.5) +
scale_fill_manual(values = c("blue", "pink")) +
theme(legend.position = "right") +
xlab('School type') Both engineering and ivy league schools have higher starting and mid-career median salaries.
Let’s run a linear regression model for median salary and school types
summary(lm(formula = start_med_slry ~ school_type, data=df_col))##
## Call:
## lm(formula = start_med_slry ~ school_type, data = df_col)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12857.9 -3026.3 -526.3 2373.7 16442.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59058 1046 56.466 <2e-16 ***
## school_typeIvy League 1417 1921 0.738 0.461
## school_typeLiberal Arts -13311 1239 -10.740 <2e-16 ***
## school_typeParty -13343 1460 -9.136 <2e-16 ***
## school_typeState -14932 1101 -13.559 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4559 on 264 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.4946
## F-statistic: 66.56 on 4 and 264 DF, p-value: < 2.2e-16
The intercept represents the starting salary at $59,058. The estimate predict from Ivy school will increse $1,417 , others type of schools will have a negative affect to the salary.
Top 10 schools by median starting salary
df_start_top10 <- df_col %>%
select(school_name, school_type, start_med_slry) %>%
arrange(desc(start_med_slry)) %>%
top_n(10)
ggplot(df_start_top10, aes(reorder(school_name, start_med_slry), start_med_slry,fill = school_type)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = start_med_slry), hjust = 1.1, color = 'gray30') +
xlab(NULL) +
coord_flip()In the top 10 list, we can see engineering and ivy league have the highest median salary in starting salary.
ggplot(df_deg, aes(x = reorder(major, start_med_slry), start_med_slry)) +
geom_col(fill = "blue", alpha = 0.5) +
geom_col(aes(x = reorder(major, mid_car_slry), mid_car_slry), alpha = 0.2) +
geom_text(aes(label = start_med_slry), size = 3, hjust = 1.1) +
xlab(NULL) +
coord_flip() +
ggtitle("Starting Salary by Majors") From the plot above, we can see that Physician has the highest median starting salaries. In the top ten highest median starting salaries, most of them are engineering and computer science related majors.
ggplot(df_deg, aes(x = reorder(major, mid_car_slry), mid_car_slry)) +
geom_col(fill = 'pink',alpha = 0.5) +
geom_col(aes(x = reorder(major, mid_car_slry), start_med_slry), alpha = 0.4) +
geom_text(aes(label = mid_car_slry), size = 3, hjust = 1.1) +
scale_fill_manual(values = c('blue', 'pink')) +
xlab(NULL) +
coord_flip() +
ggtitle("Mid-career Salary by Major")From the plot above, we can see that all of the majors have highest median starting salaries have high salaries in mid-career.
Chosing which colleges to attend and career choices are always a difficult decison, we should consider many factors including post-graduate salary. We found out that salaries differed a lot by school type and degree, but not much by location. Based on our study, studuents graduated in STEM major from Ivy or Engineering schools located in California or Northeast have the highest salary.