DATA 606 Final Project

Introduction

Each year, millions of high school students are applying to colleges and universities, visiting various campuses, and deciding what they want to study for the next four years. While this is a very exciting time in a young adult’s life, it is also a critical decision that could have a major impact on their future financial success. Therefore, college prospects should make informed decisions on which college they attend and which curriculum to study.

This leads to the major research question of this study: “How do specific college factors impact starting salaries?” In this report, I will be conducting an analysis on three factors that will impact starting salaries, which are:

Type of college (i.e. Engineering school)
College Location
College Major

Data

This section will discuss the data collection, what is in the data, and basic information about the studies that generated this data.

Data Collection

This study uses three different datasets that were collected from Kaggle, which were obtained from the Wall Street Journal (WSJ). WSJ gathers this information using Payscale Inc. The links to each dataset are provided below:

Salary by College Type - https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-college-type.csv
Salary by College Region - https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-region.csv
Salary by Degree - https://www.kaggle.com/wsj/college-salaries/version/1#degrees-that-pay-back.csv

All of these datasets were in csv file format. I could have directly pulled the data from the website, but I decided to place the csv file in my Github and pull the data from there. I retrieved each of the csv files using the read.csv() function. Next, I wanted to make sure that all the salary columns were numeric and did not have an extranneous characters. As can be seen in the R-code below, I removed the dollar sign, comma and cents (“.00”) from the salary columns and converted the columns to a numeric data type.

library(stringr)
salary.by.college.type.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salaries_by_type.csv"),header=TRUE)
for (i in 3:ncol(salary.by.college.type.df)) {
  salary.by.college.type.df[,i] <- as.character(gsub("\\$","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.character(gsub("\\,","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.character(gsub("\\.00","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.numeric(salary.by.college.type.df[,i])
}
salary.by.region.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_college_location.csv"),header=TRUE)
for (i in 3:ncol(salary.by.region.df)) {
  salary.by.region.df[,i] <- as.character(gsub("\\$","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.character(gsub("\\,","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.character(gsub("\\.00","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.numeric(salary.by.region.df[,i])
}
salary.by.major.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_major.csv"),header=TRUE)
for (i in 2:ncol(salary.by.major.df)) {
  salary.by.major.df[,i] <- as.character(gsub("\\$","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.character(gsub("\\,","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.character(gsub("\\.00","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.numeric(salary.by.major.df[,i])
}

Cases

Each of the three data sets have a different number of cases and type of cases. They are broken down in the following tabset.

Salary by College Type

Each case represents a different college/university. There are 269 cases in this dataset. For each college, they also list which type of school it is, along with salary averages.

nrow(salary.by.college.type.df)

[1] 269

Salary by College Location

Each case represents a different college/university. There are 320 cases in this dataset. For each college, they also list the region in which it is located, along with salary averages.

nrow(salary.by.region.df)

[1] 320

Salary by College Major

Each case represents a different major. There are 50 cases in this dataset. For each major, they also list salary averages.

nrow(salary.by.major.df)

[1] 50

Variables

This study has one, quantitative dependent variable, which is starting salary. However, each data set represents a different indendent variable. The independent variables in this study are:

College Type
College Location
Type of Major

Type of Study

All of these studies are observational. By definition, an observational study is when the researcher has no control over the variables in the study. This is exactly the case with these datasets, where the data is simply pulled from a database, with no control imposed by the researcher.

Scope of Inference

Generalizability

The population of interest for these datasets is US workforce who have a college degree. The information pulled from the population is then grouped by college type, location and salary. For example, if a person graduates from Princeton University with a degree in Chemical Engineering, this person’s salary information would be averaged into the Ivy League case for Salary by College Type, the Northeast case for Salary by College Location, and Chemical Engineering for Salary by Major. Since this study is taking the starting salary information for all of the country’s recent graduates, then it is fair for this to be generalized across the entire population. The only area that this study fails to mention is the percentage of graduates who are unemployed when coming out of college. This would also factor into which college/university to study at.

Causality

Since these are observational studies, we are unable to make causal relationships between the variables as there may be other contributing factors to these results. However, we are analyzing a few different factors, so we can see if the data strongly suggests specific relationships.

Exploratory Data Analysis

Each of the tabs represents the exploratory data analysis for each variable.

Salary by College Type

The number of schools per type is shown below. This study has a total of five different types of schools.

summary(salary.by.college.type.df$School.Type)

 Engineering   Ivy League Liberal Arts        Party        State 
          19            8           47           20          175

First, I wanted to look at the starting salary distribution across all the cases in the study. The data appears to be unimodal and symmetric.

salary.type <- salary.by.college.type.df$Starting.Median.Salary
hist(salary.type,main="Starting Salary Histogram",xlab="Starting Salary")

Next, I looked at the summary statistics for starting salary:

cat("Summary Statistics for Starting Salary:")

Summary Statistics for Starting Salary:

summary(salary.by.college.type.df$Starting.Median.Salary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34800   42000   44700   46068   48300   75500

Finally, I wanted to create side-by-side box plot to compare distributions by college type. The graph shows the starting salary difference as a function of school type. We can see that students from Ivy League schools tend to get paid better.

boxplot(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')

Salary by College Location

The number of schools per region is shown below. This study has a total of five different regions.

summary(salary.by.region.df$Region)

  California   Midwestern Northeastern     Southern      Western 
          28           71          100           79           42

First, I wanted to look at the starting salary distribution across all the cases in the study. The data appears to be unimodal and symmetric. It can be argued that there is a slight left skew.

salary.region <- salary.by.region.df$Starting.Median.Salary
hist(salary.region,main="Starting Salary Histogram",xlab="Starting Salary")

Next, I looked at the summary statistics for starting salary:

cat("Median Starting Salary:")

Median Starting Salary:

summary(salary.by.region.df$Starting.Median.Salary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34500   42000   45100   46253   48900   75500

Finally, I wanted to create side-by-side box plot to compare distributions by college location. The graph shows the starting salary difference as a function of location. We can see that students from the Northeast and California tend to get paid better.

boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')

Salary by College Major

A list of all the majors in the study are shown below. There is a total of 50 majors for this study.

salary.by.major.df$Undergraduate.Major

 [1] Accounting                          
 [2] Aerospace Engineering               
 [3] Agriculture                         
 [4] Anthropology                        
 [5] Architecture                        
 [6] Art History                         
 [7] Biology                             
 [8] Business Management                 
 [9] Chemical Engineering                
[10] Chemistry                           
[11] Civil Engineering                   
[12] Communications                      
[13] Computer Engineering                
[14] Computer Science                    
[15] Construction                        
[16] Criminal Justice                    
[17] Drama                               
[18] Economics                           
[19] Education                           
[20] Electrical Engineering              
[21] English                             
[22] Film                                
[23] Finance                             
[24] Forestry                            
[25] Geography                           
[26] Geology                             
[27] Graphic Design                      
[28] Health Care Administration          
[29] History                             
[30] Hospitality & Tourism               
[31] Industrial Engineering              
[32] Information Technology (IT)         
[33] Interior Design                     
[34] International Relations             
[35] Journalism                          
[36] Management Information Systems (MIS)
[37] Marketing                           
[38] Math                                
[39] Mechanical Engineering              
[40] Music                               
[41] Nursing                             
[42] Nutrition                           
[43] Philosophy                          
[44] Physician Assistant                 
[45] Physics                             
[46] Political Science                   
[47] Psychology                          
[48] Religion                            
[49] Sociology                           
[50] Spanish                             
50 Levels: Accounting Aerospace Engineering Agriculture ... Spanish

Next, I looked at the summary statistics for median starting salary and mid career salary across all majors. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.

cat("Median Starting Salary:")

Median Starting Salary:

summary(salary.by.major.df$Starting.Median.Salary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34000   37050   40850   44310   49875   74300

cat("Median Mid Career Salary:")

Median Mid Career Salary:

summary(salary.by.major.df$Mid.Career.Median.Salary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  52000   60825   72000   74786   88750  107000

Finally, I wanted to create a bar plot to compare the starting and mid-career salaries by major. The highest starting salary is for a Physician Assistant followed by six engineering majors. However, when looking at the mid-career salaries, Physician fell below all of the engineering majors.

library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Starting.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Starting.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Starting Salary (USD)')

salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Mid.Career.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Mid.Career.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Mid Career Salary (USD)')

In order to highlight the potential for salary growth, I created a new column that simply measures the difference between mid-career salary and starting salary and ordered that as shown below. We can see that Economics majors have the highest growth, while Nursing majors have the smallest growth. This is useful information for students who want to see their potential salary growth when they reach the mid-point of their careers.

salary.by.major.df$difference <- salary.by.major.df$Mid.Career.Median.Salary - salary.by.major.df$Starting.Median.Salary
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,difference)) %>% ggplot(aes(x=Undergraduate.Major,y=difference)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Growth in Salary (USD)')

Inference

Each of the tabs represents the inference analysis for each variable.

Salary by College Type

An ANOVA test was conducted in order to test if the starting median salary differed by College Type. The hypotheses are:

\(H_{0}: \mu_{\text{Eng}} = \mu_{\text{Ivy}} = \mu_{\text{Lib}} = \mu_{\text{Par}} = \mu_{\text{Sta}}\)

\(H_{a}: \text{All college types have the same median starting salary}\)

As shown in the exploratory data analysis, the data follows a normal distribution. We can also say that the observations across each group are independent of each other. Finally, we need to check if the variances across the groups are roughly equal:

type.group <- group_by(salary.by.college.type.df, School.Type)
summarise(type.group, variance = sd(Starting.Median.Salary)*sd(Starting.Median.Salary))

# A tibble: 5 x 2
  School.Type   variance
  <fct>            <dbl>
1 Engineering  61511462.
2 Ivy League   10359286.
3 Liberal Arts 19086892.
4 Party        13584500 
5 State        18224937.

The variance is within the same order of magnitude. Therefore, we can say that the conditions for inference are met and continue on with the testing.

Below is a R output of the ANOVA test:

summary(aov(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type))

                                       Df    Sum Sq   Mean Sq F value
salary.by.college.type.df$School.Type   4 5.534e+09 1.383e+09   66.56
Residuals                             264 5.487e+09 2.078e+07        
                                      Pr(>F)    
salary.by.college.type.df$School.Type <2e-16 ***
Residuals                                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test generates an \(F\)-statistic, which is then used to compute the \(p\)-value. Here we can see that the \(p\)-value is approximately 0 (\(< 2.0 \times 10^{-16}\)). Therefore, we have strong evidence to reject the null in favor of the alternative, which means that starting salary is not equal for all college types.

Finally, I took a look at the linear regression model to gain more insight at how the groups compare. The intercept represents the starting salaries for students who graduated from an Engineering school. The remaining estimates are all negative, except for the Ivy League schools. This means that the Ivy League schools had the highest starting salaries followed by the Engineering schools. These results agree with the box-plot that was shown in the exploratory data analysis section.

summary(lm(formula = Starting.Median.Salary ~ School.Type, data=salary.by.college.type.df))


Call:
lm(formula = Starting.Median.Salary ~ School.Type, data = salary.by.college.type.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-12857.9  -3026.3   -526.3   2373.7  16442.1 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)                59058       1046  56.466   <2e-16 ***
School.TypeIvy League       1417       1921   0.738    0.461    
School.TypeLiberal Arts   -13311       1239 -10.740   <2e-16 ***
School.TypeParty          -13343       1460  -9.136   <2e-16 ***
School.TypeState          -14932       1101 -13.559   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4559 on 264 degrees of freedom
Multiple R-squared:  0.5021,    Adjusted R-squared:  0.4946 
F-statistic: 66.56 on 4 and 264 DF,  p-value: < 2.2e-16

Salary by College Location

An ANOVA test was conducted in order to test if the starting median salary differed by College Location. The hypotheses are:

\(H_{0}: \mu_{\text{Cal}} = \mu_{\text{Mid}} = \mu_{\text{NE}} = \mu_{\text{South}} = \mu_{\text{West}}\)

\(H_{a}: \text{All college locations have the same median starting salary}\)

location.group <- group_by(salary.by.region.df, Region)
summarise(location.group, variance = sd(Starting.Median.Salary)*sd(Starting.Median.Salary))

# A tibble: 5 x 2
  Region        variance
  <fct>            <dbl>
1 California   80677077.
2 Midwestern   25687062.
3 Northeastern 51251297.
4 Southern     31652993.
5 Western      15485645.

The variance is within the same order of magnitude. Therefore, we can say that the conditions for inference are met and continue on with the testing.

Below is a R output of the ANOVA test:

summary(aov(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region))

                            Df    Sum Sq   Mean Sq F value   Pr(>F)    
salary.by.region.df$Region   4 1.813e+09 453344384   11.75 6.59e-09 ***
Residuals                  315 1.215e+10  38584440                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we can see that the \(p\)-value is approximately 0 (\(6.59 \times 10^{-9}\)). Therefore, we have strong evidence to reject the null in favor of the alternative, which means that starting salary is not equal for each region.

Finally, I took a look at the linear regression model to get a better look at how the groups compare. The intercept represents the starting salaries for students who graduated from California schools. The remaining estimates are all negative, which means that none of the other locations had as high of salaries as California. The Northeast schools had the smallest number in terms of magnitude, which means it was the closest to California. These results agree with the box-plot that was shown in the exploratory data analysis section.

summary(lm(formula = Starting.Median.Salary ~ Region, data=salary.by.region.df))


Call:
lm(formula = Starting.Median.Salary ~ Region, data = salary.by.region.df)

Residuals:
   Min     1Q Median     3Q    Max 
-13032  -3943  -1123   2470  24468 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)           51032       1174  43.473  < 2e-16 ***
RegionMidwestern      -6807       1386  -4.911 1.46e-06 ***
RegionNortheastern    -2536       1328  -1.910   0.0571 .  
RegionSouthern        -6511       1366  -4.766 2.88e-06 ***
RegionWestern         -6618       1516  -4.367 1.71e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6212 on 315 degrees of freedom
Multiple R-squared:  0.1298,    Adjusted R-squared:  0.1188 
F-statistic: 11.75 on 4 and 315 DF,  p-value: 6.594e-09

Salary by College Major

I did not perform any inference testing on this group, as the most important information was shown in the exploratory analysis section.

Conclusion

The major finding from this study is that starting salary is not equivalent for all college selection variables (i.e. type of college). This is supported by the following findings:

Ivy League and Engineering schools tend to have higher starting salaries than the other college types
Graduates from schools located in California had the highest starting salary, followed by graduates from schools in the Northeast
It appears that the STEM majors had higher starting salaries than non-STEM majors. The greatest potential for growth in salary was Economics.

All of these findings were supported by statistical analyses seen in the Exploratory and Inference sections.

It is important to reiterate that these types of analyses could be very useful for future college students on what major they select. While financial stability should not be the only reason for selecting a major, but it may be a deciding factor between two majors.

In the future, it would be useful to merge these data sets by joining them on common variables. By doing this we can identify, which factors contribute the most to starting salary rather than proving that each variable has an impact.

DATA 606 Final Project

Ryan Gordon

May 15, 2019

Introduction

Data

Data Collection

Cases

Salary by College Type

Salary by College Location

Salary by College Major

Variables

Type of Study

Scope of Inference

Generalizability

Causality

Exploratory Data Analysis

Salary by College Type

Salary by College Location

Salary by College Major

Inference

Salary by College Type

Salary by College Location

Salary by College Major

Conclusion