Each year, millions of high school students are applying to colleges and universities, visiting various campuses, and deciding what they want to study for the next four years. While this is a very exciting time in a young adult’s life, it is also a critical decision that could have a major impact on their future financial success. Therefore, college prospects should make informed decisions on which college they attend and which curriculum to study.
This leads to the major research question of this study: “How do specific college factors impact starting salaries?” In this report, I will be conducting an analysis on three factors that will impact starting salaries, which are:
Type of college (i.e. Engineering school)
College Location
College Major
This section will discuss the data collection, what is in the data, and basic information about the studies that generated this data.
This study uses three different datasets that were collected from Kaggle, which were obtained from the Wall Street Journal (WSJ). WSJ gathers this information using Payscale Inc. The links to each dataset are provided below:
Salary by College Type - https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-college-type.csv
Salary by College Region - https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-region.csv
Salary by Degree - https://www.kaggle.com/wsj/college-salaries/version/1#degrees-that-pay-back.csv
All of these datasets were in csv file format. I could have directly pulled the data from the website, but I decided to place the csv file in my Github and pull the data from there. I retrieved each of the csv files using the read.csv() function. Next, I wanted to make sure that all the salary columns were numeric and did not have an extranneous characters. As can be seen in the R-code below, I removed the dollar sign, comma and cents (“.00”) from the salary columns and converted the columns to a numeric data type.
library(stringr)
salary.by.college.type.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salaries_by_type.csv"),header=TRUE)
for (i in 3:ncol(salary.by.college.type.df)) {
salary.by.college.type.df[,i] <- as.character(gsub("\\$","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.character(gsub("\\,","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.character(gsub("\\.00","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.numeric(salary.by.college.type.df[,i])
}
salary.by.region.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_college_location.csv"),header=TRUE)
for (i in 3:ncol(salary.by.region.df)) {
salary.by.region.df[,i] <- as.character(gsub("\\$","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.character(gsub("\\,","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.character(gsub("\\.00","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.numeric(salary.by.region.df[,i])
}
salary.by.major.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_major.csv"),header=TRUE)
for (i in 2:ncol(salary.by.major.df)) {
salary.by.major.df[,i] <- as.character(gsub("\\$","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.character(gsub("\\,","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.character(gsub("\\.00","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.numeric(salary.by.major.df[,i])
}Each of the three data sets have a different number of cases and type of cases. They are broken down in the following tabset.
Each case represents a different college/university. There are 269 cases in this dataset. For each college, they also list which type of school it is, along with salary averages.
nrow(salary.by.college.type.df)[1] 269
Each case represents a different college/university. There are 320 cases in this dataset. For each college, they also list the region in which it is located, along with salary averages.
nrow(salary.by.region.df)[1] 320
Each case represents a different major. There are 50 cases in this dataset. For each major, they also list salary averages.
nrow(salary.by.major.df)[1] 50
This study has one, quantitative dependent variable, which is starting salary. However, each data set represents a different indendent variable. The independent variables in this study are:
College Type
College Location
Type of Major
All of these studies are observational. By definition, an observational study is when the researcher has no control over the variables in the study. This is exactly the case with these datasets, where the data is simply pulled from a database, with no control imposed by the researcher.
The population of interest for these datasets is US workforce who have a college degree. The information pulled from the population is then grouped by college type, location and salary. For example, if a person graduates from Princeton University with a degree in Chemical Engineering, this person’s salary information would be averaged into the Ivy League case for Salary by College Type, the Northeast case for Salary by College Location, and Chemical Engineering for Salary by Major. Since this study is taking the starting salary information for all of the country’s recent graduates, then it is fair for this to be generalized across the entire population. The only area that this study fails to mention is the percentage of graduates who are unemployed when coming out of college. This would also factor into which college/university to study at.
Since these are observational studies, we are unable to make causal relationships between the variables as there may be other contributing factors to these results. However, we are analyzing a few different factors, so we can see if the data strongly suggests specific relationships.
Each of the tabs represents the exploratory data analysis for each variable.
The number of schools per type is shown below. This study has a total of five different types of schools.
summary(salary.by.college.type.df$School.Type) Engineering Ivy League Liberal Arts Party State
19 8 47 20 175
First, I wanted to look at the starting salary distribution across all the cases in the study. The data appears to be unimodal and symmetric.
salary.type <- salary.by.college.type.df$Starting.Median.Salary
hist(salary.type,main="Starting Salary Histogram",xlab="Starting Salary")Next, I looked at the summary statistics for starting salary:
cat("Summary Statistics for Starting Salary:")Summary Statistics for Starting Salary:
summary(salary.by.college.type.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34800 42000 44700 46068 48300 75500
Finally, I wanted to create side-by-side box plot to compare distributions by college type. The graph shows the starting salary difference as a function of school type. We can see that students from Ivy League schools tend to get paid better.
boxplot(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')The number of schools per region is shown below. This study has a total of five different regions.
summary(salary.by.region.df$Region) California Midwestern Northeastern Southern Western
28 71 100 79 42
First, I wanted to look at the starting salary distribution across all the cases in the study. The data appears to be unimodal and symmetric. It can be argued that there is a slight left skew.
salary.region <- salary.by.region.df$Starting.Median.Salary
hist(salary.region,main="Starting Salary Histogram",xlab="Starting Salary")Next, I looked at the summary statistics for starting salary:
cat("Median Starting Salary:")Median Starting Salary:
summary(salary.by.region.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34500 42000 45100 46253 48900 75500
Finally, I wanted to create side-by-side box plot to compare distributions by college location. The graph shows the starting salary difference as a function of location. We can see that students from the Northeast and California tend to get paid better.
boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')A list of all the majors in the study are shown below. There is a total of 50 majors for this study.
salary.by.major.df$Undergraduate.Major [1] Accounting
[2] Aerospace Engineering
[3] Agriculture
[4] Anthropology
[5] Architecture
[6] Art History
[7] Biology
[8] Business Management
[9] Chemical Engineering
[10] Chemistry
[11] Civil Engineering
[12] Communications
[13] Computer Engineering
[14] Computer Science
[15] Construction
[16] Criminal Justice
[17] Drama
[18] Economics
[19] Education
[20] Electrical Engineering
[21] English
[22] Film
[23] Finance
[24] Forestry
[25] Geography
[26] Geology
[27] Graphic Design
[28] Health Care Administration
[29] History
[30] Hospitality & Tourism
[31] Industrial Engineering
[32] Information Technology (IT)
[33] Interior Design
[34] International Relations
[35] Journalism
[36] Management Information Systems (MIS)
[37] Marketing
[38] Math
[39] Mechanical Engineering
[40] Music
[41] Nursing
[42] Nutrition
[43] Philosophy
[44] Physician Assistant
[45] Physics
[46] Political Science
[47] Psychology
[48] Religion
[49] Sociology
[50] Spanish
50 Levels: Accounting Aerospace Engineering Agriculture ... Spanish
Next, I looked at the summary statistics for median starting salary and mid career salary across all majors. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.
cat("Median Starting Salary:")Median Starting Salary:
summary(salary.by.major.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34000 37050 40850 44310 49875 74300
cat("Median Mid Career Salary:")Median Mid Career Salary:
summary(salary.by.major.df$Mid.Career.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
52000 60825 72000 74786 88750 107000
Finally, I wanted to create a bar plot to compare the starting and mid-career salaries by major. The highest starting salary is for a Physician Assistant followed by six engineering majors. However, when looking at the mid-career salaries, Physician fell below all of the engineering majors.
library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Starting.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Starting.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Starting Salary (USD)')salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Mid.Career.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Mid.Career.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Mid Career Salary (USD)')In order to highlight the potential for salary growth, I created a new column that simply measures the difference between mid-career salary and starting salary and ordered that as shown below. We can see that Economics majors have the highest growth, while Nursing majors have the smallest growth. This is useful information for students who want to see their potential salary growth when they reach the mid-point of their careers.
salary.by.major.df$difference <- salary.by.major.df$Mid.Career.Median.Salary - salary.by.major.df$Starting.Median.Salary
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,difference)) %>% ggplot(aes(x=Undergraduate.Major,y=difference)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Growth in Salary (USD)')Each of the tabs represents the inference analysis for each variable.
An ANOVA test was conducted in order to test if the starting median salary differed by College Type. The hypotheses are:
\(H_{0}: \mu_{\text{Eng}} = \mu_{\text{Ivy}} = \mu_{\text{Lib}} = \mu_{\text{Par}} = \mu_{\text{Sta}}\)
\(H_{a}: \text{All college types have the same median starting salary}\)
As shown in the exploratory data analysis, the data follows a normal distribution. We can also say that the observations across each group are independent of each other. Finally, we need to check if the variances across the groups are roughly equal:
type.group <- group_by(salary.by.college.type.df, School.Type)
summarise(type.group, variance = sd(Starting.Median.Salary)*sd(Starting.Median.Salary))# A tibble: 5 x 2
School.Type variance
<fct> <dbl>
1 Engineering 61511462.
2 Ivy League 10359286.
3 Liberal Arts 19086892.
4 Party 13584500
5 State 18224937.
The variance is within the same order of magnitude. Therefore, we can say that the conditions for inference are met and continue on with the testing.
Below is a R output of the ANOVA test:
summary(aov(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type)) Df Sum Sq Mean Sq F value
salary.by.college.type.df$School.Type 4 5.534e+09 1.383e+09 66.56
Residuals 264 5.487e+09 2.078e+07
Pr(>F)
salary.by.college.type.df$School.Type <2e-16 ***
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test generates an \(F\)-statistic, which is then used to compute the \(p\)-value. Here we can see that the \(p\)-value is approximately 0 (\(< 2.0 \times 10^{-16}\)). Therefore, we have strong evidence to reject the null in favor of the alternative, which means that starting salary is not equal for all college types.
Finally, I took a look at the linear regression model to gain more insight at how the groups compare. The intercept represents the starting salaries for students who graduated from an Engineering school. The remaining estimates are all negative, except for the Ivy League schools. This means that the Ivy League schools had the highest starting salaries followed by the Engineering schools. These results agree with the box-plot that was shown in the exploratory data analysis section.
summary(lm(formula = Starting.Median.Salary ~ School.Type, data=salary.by.college.type.df))
Call:
lm(formula = Starting.Median.Salary ~ School.Type, data = salary.by.college.type.df)
Residuals:
Min 1Q Median 3Q Max
-12857.9 -3026.3 -526.3 2373.7 16442.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59058 1046 56.466 <2e-16 ***
School.TypeIvy League 1417 1921 0.738 0.461
School.TypeLiberal Arts -13311 1239 -10.740 <2e-16 ***
School.TypeParty -13343 1460 -9.136 <2e-16 ***
School.TypeState -14932 1101 -13.559 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4559 on 264 degrees of freedom
Multiple R-squared: 0.5021, Adjusted R-squared: 0.4946
F-statistic: 66.56 on 4 and 264 DF, p-value: < 2.2e-16
An ANOVA test was conducted in order to test if the starting median salary differed by College Location. The hypotheses are:
\(H_{0}: \mu_{\text{Cal}} = \mu_{\text{Mid}} = \mu_{\text{NE}} = \mu_{\text{South}} = \mu_{\text{West}}\)
\(H_{a}: \text{All college locations have the same median starting salary}\)
As shown in the exploratory data analysis, the data follows a normal distribution. We can also say that the observations across each group are independent of each other. Finally, we need to check if the variances across the groups are roughly equal:
location.group <- group_by(salary.by.region.df, Region)
summarise(location.group, variance = sd(Starting.Median.Salary)*sd(Starting.Median.Salary))# A tibble: 5 x 2
Region variance
<fct> <dbl>
1 California 80677077.
2 Midwestern 25687062.
3 Northeastern 51251297.
4 Southern 31652993.
5 Western 15485645.
The variance is within the same order of magnitude. Therefore, we can say that the conditions for inference are met and continue on with the testing.
Below is a R output of the ANOVA test:
summary(aov(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region)) Df Sum Sq Mean Sq F value Pr(>F)
salary.by.region.df$Region 4 1.813e+09 453344384 11.75 6.59e-09 ***
Residuals 315 1.215e+10 38584440
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we can see that the \(p\)-value is approximately 0 (\(6.59 \times 10^{-9}\)). Therefore, we have strong evidence to reject the null in favor of the alternative, which means that starting salary is not equal for each region.
Finally, I took a look at the linear regression model to get a better look at how the groups compare. The intercept represents the starting salaries for students who graduated from California schools. The remaining estimates are all negative, which means that none of the other locations had as high of salaries as California. The Northeast schools had the smallest number in terms of magnitude, which means it was the closest to California. These results agree with the box-plot that was shown in the exploratory data analysis section.
summary(lm(formula = Starting.Median.Salary ~ Region, data=salary.by.region.df))
Call:
lm(formula = Starting.Median.Salary ~ Region, data = salary.by.region.df)
Residuals:
Min 1Q Median 3Q Max
-13032 -3943 -1123 2470 24468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51032 1174 43.473 < 2e-16 ***
RegionMidwestern -6807 1386 -4.911 1.46e-06 ***
RegionNortheastern -2536 1328 -1.910 0.0571 .
RegionSouthern -6511 1366 -4.766 2.88e-06 ***
RegionWestern -6618 1516 -4.367 1.71e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6212 on 315 degrees of freedom
Multiple R-squared: 0.1298, Adjusted R-squared: 0.1188
F-statistic: 11.75 on 4 and 315 DF, p-value: 6.594e-09
I did not perform any inference testing on this group, as the most important information was shown in the exploratory analysis section.
The major finding from this study is that starting salary is not equivalent for all college selection variables (i.e. type of college). This is supported by the following findings:
All of these findings were supported by statistical analyses seen in the Exploratory and Inference sections.
It is important to reiterate that these types of analyses could be very useful for future college students on what major they select. While financial stability should not be the only reason for selecting a major, but it may be a deciding factor between two majors.
In the future, it would be useful to merge these data sets by joining them on common variables. By doing this we can identify, which factors contribute the most to starting salary rather than proving that each variable has an impact.