Data Preparation

For this project, I will use three different data frames to address my research questions. Below is the R-code used to collect all the data.

library(stringr)
salary.by.college.type.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salaries_by_type.csv"),header=TRUE)
for (i in 3:ncol(salary.by.college.type.df)) {
  salary.by.college.type.df[,i] <- as.character(gsub("\\$","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.character(gsub("\\,","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.character(gsub("\\.00","",salary.by.college.type.df[,i]))
  salary.by.college.type.df[,i] <- as.numeric(salary.by.college.type.df[,i])
}
salary.by.region.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_college_location.csv"),header=TRUE)
for (i in 3:ncol(salary.by.region.df)) {
  salary.by.region.df[,i] <- as.character(gsub("\\$","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.character(gsub("\\,","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.character(gsub("\\.00","",salary.by.region.df[,i]))
  salary.by.region.df[,i] <- as.numeric(salary.by.region.df[,i])
}
salary.by.major.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_major.csv"),header=TRUE)
for (i in 2:ncol(salary.by.major.df)) {
  salary.by.major.df[,i] <- as.character(gsub("\\$","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.character(gsub("\\,","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.character(gsub("\\.00","",salary.by.major.df[,i]))
  salary.by.major.df[,i] <- as.numeric(salary.by.major.df[,i])
}

The initial data cleanup consisted of making sure the salary columns were displayed as numeric. I removed the dollar sign, comma, and cents (“.00”), and then converted the columns to numeric.

Research question

My major, overlying research is question is, “Which college factors affect starting and long-term salaries?” More specifically, I will look at the following research questions:

  1. Does type of college (i.e. Engineering school) impact post-graduation salary?
  2. Does college location affect post-graduation salary?
  3. Is salary dependent on college major?

Cases

Each of the three data sets have a different number of cases and type of cases. They are broken down in the following tabset.

Salary by College Type

Each case represents a different college/university. There are 269 cases in this dataset. For each college, they also list which type of school it is, along with salary averages.

nrow(salary.by.college.type.df)
[1] 269

Salary by College Location

Each case represents a different college/university. There are 320 cases in this dataset. For each college, they also list state in which it is located, along with salary averages.

nrow(salary.by.region.df)
[1] 320

Salary by College Major

Each case represents a different major. There are 50 cases in this dataset. For each major, they also list salary averages.

nrow(salary.by.major.df)
[1] 50

Data collection

These datasets were collected from Kaggle, which was obtained from the Wall Street Journal based on data from Payscale, Inc. The data was in csv file format, so I took the csv’s and placed them in my Github. I then retrieved each of the csv files using the read.csv() function, and the link to the raw Github pages for this. I could have directly got the information from Kaggle’s website using the same method, but I wanted the data in my Github so it was all in one location.

Type of study

All of these studies are observational.

Data Source

As stated in the “Data collection” section, the datasets were collected from Kaggle, which was obtained from the Wall Street Journal based on data from Payscale, Inc. The links to the datasets I used are:

  1. https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-college-type.csv

  2. https://www.kaggle.com/wsj/college-salaries/version/1#salaries-by-region.csv

  3. https://www.kaggle.com/wsj/college-salaries/version/1#degrees-that-pay-back.csv

Dependent Variable

The dependent variable is the salary income. This is quantitative.

Independent Variable

Each of the data sets has a different independent variable. The independent variables in this study are:

  1. College Type

  2. College Location

  3. Type of Major

Another independent variable would be length of career. For example, the starting salaries should differ from the median salaries.

Relevant summary statistics

Each tab will provide some of the relevant summary statistics for each dataset.

Salary by College Type

The number of schools per type is shown below. This study has a total of five different types of schools.

summary(salary.by.college.type.df$School.Type)
 Engineering   Ivy League Liberal Arts        Party        State 
          19            8           47           20          175 

Next, I looked at the summary statistics for median starting salary and mid career salary across all school types. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.

cat("Median Starting Salary:")
Median Starting Salary:
summary(salary.by.college.type.df$Starting.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34800   42000   44700   46068   48300   75500 
cat("Median Mid Career Salary:")
Median Mid Career Salary:
summary(salary.by.college.type.df$Mid.Career.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  43900   74000   81600   83932   92200  134000 

Finally, I wanted to create side-by-side box plot to compare distributions by college type. The first graph shows the starting salary difference as a function of school type, while the second compares the Mid Career salary. For both plots, we can see that students from Ivy League schools tend to get paid better.

boxplot(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')

boxplot(salary.by.college.type.df$Mid.Career.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Mid Career Salary')

In my project, I will look at some significance tests to see if their is any difference between the means.

Salary by College Location

The number of schools per region is shown below. This study has a total of five different regions.

summary(salary.by.region.df$Region)
  California   Midwestern Northeastern     Southern      Western 
          28           71          100           79           42 

Next, I looked at the summary statistics for median starting salary and mid career salary across all regions. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.

cat("Median Starting Salary:")
Median Starting Salary:
summary(salary.by.region.df$Starting.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34500   42000   45100   46253   48900   75500 
cat("Median Mid Career Salary:")
Median Mid Career Salary:
summary(salary.by.region.df$Mid.Career.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  43900   73725   82700   83934   93250  134000 

Finally, I wanted to create side-by-side box plot to compare distributions by region. The first graph shows the starting salary difference as a function of region, while the second compares the Mid Career salary. For both plots, we can see that students from schools in California and the Northeast tend to get paid better. However, the distance between the regions is not as large as for the salary by college type.

boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')

boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Mid Career Salary')

In my project, I will look at some significance tests to see if their is any difference between the means.

Salary by College Major

A list of all the majors in the study are shown below. There is a total of 50 majors for this study.

salary.by.major.df$Undergraduate.Major
 [1] Accounting                          
 [2] Aerospace Engineering               
 [3] Agriculture                         
 [4] Anthropology                        
 [5] Architecture                        
 [6] Art History                         
 [7] Biology                             
 [8] Business Management                 
 [9] Chemical Engineering                
[10] Chemistry                           
[11] Civil Engineering                   
[12] Communications                      
[13] Computer Engineering                
[14] Computer Science                    
[15] Construction                        
[16] Criminal Justice                    
[17] Drama                               
[18] Economics                           
[19] Education                           
[20] Electrical Engineering              
[21] English                             
[22] Film                                
[23] Finance                             
[24] Forestry                            
[25] Geography                           
[26] Geology                             
[27] Graphic Design                      
[28] Health Care Administration          
[29] History                             
[30] Hospitality & Tourism               
[31] Industrial Engineering              
[32] Information Technology (IT)         
[33] Interior Design                     
[34] International Relations             
[35] Journalism                          
[36] Management Information Systems (MIS)
[37] Marketing                           
[38] Math                                
[39] Mechanical Engineering              
[40] Music                               
[41] Nursing                             
[42] Nutrition                           
[43] Philosophy                          
[44] Physician Assistant                 
[45] Physics                             
[46] Political Science                   
[47] Psychology                          
[48] Religion                            
[49] Sociology                           
[50] Spanish                             
50 Levels: Accounting Aerospace Engineering Agriculture ... Spanish

Next, I looked at the summary statistics for median starting salary and mid career salary across all majors. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.

cat("Median Starting Salary:")
Median Starting Salary:
summary(salary.by.major.df$Starting.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34000   37050   40850   44310   49875   74300 
cat("Median Mid Career Salary:")
Median Mid Career Salary:
summary(salary.by.major.df$Mid.Career.Median.Salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  52000   60825   72000   74786   88750  107000 

Finally, I wanted to create a bar plot to compare the starting and mid-career salaries by major. The highest starting salary is for a Physician Assistant followed by six engineering majors. However, when looking at the mid-career salaries, Physician fell below all of the engineering majors. It would be interesting to look at which majors have the highest potential for salary growth.

library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Starting.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Starting.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Starting Salary (USD)')

salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Mid.Career.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Mid.Career.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Mid Career Salary (USD)')