For this project, I will use three different data frames to address my research questions. Below is the R-code used to collect all the data.
library(stringr)
salary.by.college.type.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salaries_by_type.csv"),header=TRUE)
for (i in 3:ncol(salary.by.college.type.df)) {
salary.by.college.type.df[,i] <- as.character(gsub("\\$","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.character(gsub("\\,","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.character(gsub("\\.00","",salary.by.college.type.df[,i]))
salary.by.college.type.df[,i] <- as.numeric(salary.by.college.type.df[,i])
}
salary.by.region.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_college_location.csv"),header=TRUE)
for (i in 3:ncol(salary.by.region.df)) {
salary.by.region.df[,i] <- as.character(gsub("\\$","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.character(gsub("\\,","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.character(gsub("\\.00","",salary.by.region.df[,i]))
salary.by.region.df[,i] <- as.numeric(salary.by.region.df[,i])
}
salary.by.major.df <- read.csv(url("https://raw.githubusercontent.com/rg563/DATA606/master/Project/salary_by_major.csv"),header=TRUE)
for (i in 2:ncol(salary.by.major.df)) {
salary.by.major.df[,i] <- as.character(gsub("\\$","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.character(gsub("\\,","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.character(gsub("\\.00","",salary.by.major.df[,i]))
salary.by.major.df[,i] <- as.numeric(salary.by.major.df[,i])
}The initial data cleanup consisted of making sure the salary columns were displayed as numeric. I removed the dollar sign, comma, and cents (“.00”), and then converted the columns to numeric.
My major, overlying research is question is, “Which college factors affect starting and long-term salaries?” More specifically, I will look at the following research questions:
Each of the three data sets have a different number of cases and type of cases. They are broken down in the following tabset.
Each case represents a different college/university. There are 269 cases in this dataset. For each college, they also list which type of school it is, along with salary averages.
nrow(salary.by.college.type.df)[1] 269
Each case represents a different college/university. There are 320 cases in this dataset. For each college, they also list state in which it is located, along with salary averages.
nrow(salary.by.region.df)[1] 320
Each case represents a different major. There are 50 cases in this dataset. For each major, they also list salary averages.
nrow(salary.by.major.df)[1] 50
These datasets were collected from Kaggle, which was obtained from the Wall Street Journal based on data from Payscale, Inc. The data was in csv file format, so I took the csv’s and placed them in my Github. I then retrieved each of the csv files using the read.csv() function, and the link to the raw Github pages for this. I could have directly got the information from Kaggle’s website using the same method, but I wanted the data in my Github so it was all in one location.
All of these studies are observational.
As stated in the “Data collection” section, the datasets were collected from Kaggle, which was obtained from the Wall Street Journal based on data from Payscale, Inc. The links to the datasets I used are:
The dependent variable is the salary income. This is quantitative.
Each of the data sets has a different independent variable. The independent variables in this study are:
College Type
College Location
Type of Major
Another independent variable would be length of career. For example, the starting salaries should differ from the median salaries.
Each tab will provide some of the relevant summary statistics for each dataset.
The number of schools per type is shown below. This study has a total of five different types of schools.
summary(salary.by.college.type.df$School.Type) Engineering Ivy League Liberal Arts Party State
19 8 47 20 175
Next, I looked at the summary statistics for median starting salary and mid career salary across all school types. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.
cat("Median Starting Salary:")Median Starting Salary:
summary(salary.by.college.type.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34800 42000 44700 46068 48300 75500
cat("Median Mid Career Salary:")Median Mid Career Salary:
summary(salary.by.college.type.df$Mid.Career.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
43900 74000 81600 83932 92200 134000
Finally, I wanted to create side-by-side box plot to compare distributions by college type. The first graph shows the starting salary difference as a function of school type, while the second compares the Mid Career salary. For both plots, we can see that students from Ivy League schools tend to get paid better.
boxplot(salary.by.college.type.df$Starting.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')boxplot(salary.by.college.type.df$Mid.Career.Median.Salary ~ salary.by.college.type.df$School.Type,xlab='School Type',ylab='Salary (USD)',main='Mid Career Salary')In my project, I will look at some significance tests to see if their is any difference between the means.
The number of schools per region is shown below. This study has a total of five different regions.
summary(salary.by.region.df$Region) California Midwestern Northeastern Southern Western
28 71 100 79 42
Next, I looked at the summary statistics for median starting salary and mid career salary across all regions. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.
cat("Median Starting Salary:")Median Starting Salary:
summary(salary.by.region.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34500 42000 45100 46253 48900 75500
cat("Median Mid Career Salary:")Median Mid Career Salary:
summary(salary.by.region.df$Mid.Career.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
43900 73725 82700 83934 93250 134000
Finally, I wanted to create side-by-side box plot to compare distributions by region. The first graph shows the starting salary difference as a function of region, while the second compares the Mid Career salary. For both plots, we can see that students from schools in California and the Northeast tend to get paid better. However, the distance between the regions is not as large as for the salary by college type.
boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Starting Salary')boxplot(salary.by.region.df$Starting.Median.Salary ~ salary.by.region.df$Region,xlab='School Type',ylab='Salary (USD)',main='Mid Career Salary')In my project, I will look at some significance tests to see if their is any difference between the means.
A list of all the majors in the study are shown below. There is a total of 50 majors for this study.
salary.by.major.df$Undergraduate.Major [1] Accounting
[2] Aerospace Engineering
[3] Agriculture
[4] Anthropology
[5] Architecture
[6] Art History
[7] Biology
[8] Business Management
[9] Chemical Engineering
[10] Chemistry
[11] Civil Engineering
[12] Communications
[13] Computer Engineering
[14] Computer Science
[15] Construction
[16] Criminal Justice
[17] Drama
[18] Economics
[19] Education
[20] Electrical Engineering
[21] English
[22] Film
[23] Finance
[24] Forestry
[25] Geography
[26] Geology
[27] Graphic Design
[28] Health Care Administration
[29] History
[30] Hospitality & Tourism
[31] Industrial Engineering
[32] Information Technology (IT)
[33] Interior Design
[34] International Relations
[35] Journalism
[36] Management Information Systems (MIS)
[37] Marketing
[38] Math
[39] Mechanical Engineering
[40] Music
[41] Nursing
[42] Nutrition
[43] Philosophy
[44] Physician Assistant
[45] Physics
[46] Political Science
[47] Psychology
[48] Religion
[49] Sociology
[50] Spanish
50 Levels: Accounting Aerospace Engineering Agriculture ... Spanish
Next, I looked at the summary statistics for median starting salary and mid career salary across all majors. Initially, we can see a clear difference in the average salaries for people who are later in their career compared to just starting out which is expected.
cat("Median Starting Salary:")Median Starting Salary:
summary(salary.by.major.df$Starting.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
34000 37050 40850 44310 49875 74300
cat("Median Mid Career Salary:")Median Mid Career Salary:
summary(salary.by.major.df$Mid.Career.Median.Salary) Min. 1st Qu. Median Mean 3rd Qu. Max.
52000 60825 72000 74786 88750 107000
Finally, I wanted to create a bar plot to compare the starting and mid-career salaries by major. The highest starting salary is for a Physician Assistant followed by six engineering majors. However, when looking at the mid-career salaries, Physician fell below all of the engineering majors. It would be interesting to look at which majors have the highest potential for salary growth.
library(ggplot2)
library(tidyverse)
library(dplyr)
library(forcats)
salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Starting.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Starting.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Starting Salary (USD)')salary.by.major.df %>% mutate(Undergraduate.Major = fct_reorder(Undergraduate.Major,Mid.Career.Median.Salary)) %>% ggplot(aes(x=Undergraduate.Major,y=Mid.Career.Median.Salary)) + geom_bar(stat="identity") + coord_flip() + xlab('Undergraduate Major') + ylab('Mid Career Salary (USD)')