#Salaries for undergraduate majors
degree <- read.csv("degrees-that-pay-back.csv",stringsAsFactors = TRUE)
#Salaries by regions
sal_reg <- read.csv("salaries-by-region.csv",na.strings = "N/A",stringsAsFactors = TRUE)
#Salaries by school type
sal_type <- read.csv("salaries-by-college-type.csv",na.strings = "N/A",stringsAsFactors = TRUE)Final Project: A Deep Dive into the Value of Education
ETR537
1. Introduction
The goal of this data study is to help people make smart decisions about which college to attend by looking at graduate incomes. The goal is to find the factors that impact pay growth, differences between regions, and differences between schools.
Choosing the right college is a very difficult. There are many factors to consider, such as the amount of money that could be made in the future. Those earnings may be different depending on where you live and what kind of college you go to. The goal of this report is to look into graduate incomes and help people make job choices.
Data comes from Payscale, Inc. which is the foundation for statistics from the Wall Street Journal and downloaded from Kaggle (Salary Dataset):
Salary Increase by College Type
Salary by Region
Major Salary Increase
Research Questions
What are the disciplines with the fastest growing salary throughout the career?
Which regions offer the highest salaries ?
Students from which type of schools have higher starting salaries?
2. Data Wrangling and Pre-processing
Renaming variables
To enhance clarity and consistency, variables in each data-set were renamed. This step ensures a standardized approach to the subsequent analysis.
#load required libraries
library(tidyverse)
library(forcats)
library(scales)
library(gridExtra)
##### Renaming variables
#degrees-that-pay-back data
names(degree) <- c("College_Major","Starting_Median_Salary","Mid_Career_Median_Salary",
"Career_Percent_Growth","Percentile_10","Percentile_25","Percentile_75",
"Percentile_90")
#salaries-by-region
names(sal_reg) <- c("School_Name","Region","Starting_Median_Salary","Mid_Career_Median_Salary",
"Percentile_10","Percentile_25","Percentile_75","Percentile_90")
#salaries-by-college-type
names(sal_type) <- c("School_Name","School_Type","Starting_Median_Salary","Mid_Career_Median_Salary",
"Percentile_10","Percentile_25","Percentile_75","Percentile_90")Removing Unnecessary Terms
I removed dollar signs from salary values to convert them into numeric format, this leads to streamlining the subsequent analyses.
#degrees-that-pay-back data
degree <- degree %>%
mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))
#salaries-by-region
sal_reg <- sal_reg %>%
mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))
#salaries-by-college-type
sal_type <- sal_type %>%
mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))Dealing with Missing Data
The presence of missing values were assessed and addressed for each data-set (removal, imputation). These steps were necessary to ensure the robustness of the analysis.
#degrees-that-pay-back data
apply(is.na(degree), 2, sum) College_Major Starting_Median_Salary Mid_Career_Median_Salary
0 0 0
Career_Percent_Growth Percentile_10 Percentile_25
0 0 0
Percentile_75 Percentile_90
0 0
#salaries-by-region
apply(is.na(sal_reg), 2, sum) School_Name Region Starting_Median_Salary
0 0 0
Mid_Career_Median_Salary Percentile_10 Percentile_25
0 47 0
Percentile_75 Percentile_90
0 47
#salaries-by-college-type
apply(is.na(sal_type), 2, sum) School_Name School_Type Starting_Median_Salary
0 0 0
Mid_Career_Median_Salary Percentile_10 Percentile_25
0 38 0
Percentile_75 Percentile_90
0 38
# Removing NA values
degree <- na.omit(degree)
sal_reg <- na.omit(sal_reg)
sal_type <- na.omit(sal_type)3. Analyze
Descriptive Statistics
I identified the top 10 careers with the fastest growth based on career percentage growth. Notable examples include majors such as Mathematics, Philosophy, and International Relations, which exhibit remarkable career growth.
#careers with fastest growth
growth <- degree %>%
select(College_Major,Starting_Median_Salary,Mid_Career_Median_Salary,
Career_Percent_Growth) %>%
arrange(desc(Career_Percent_Growth)) %>%
top_n(10)
growth College_Major Starting_Median_Salary Mid_Career_Median_Salary
1 Math 45400 92400
2 Philosophy 39900 81200
3 International Relations 40900 80900
4 Economics 50100 98600
5 Marketing 40800 79600
6 Physics 50300 97300
7 Political Science 40800 78200
8 Chemistry 42600 79900
9 Journalism 35600 66700
10 Architecture 41600 76800
Career_Percent_Growth
1 103.5
2 103.5
3 97.8
4 96.8
5 95.1
6 93.4
7 91.7
8 87.6
9 87.4
10 84.6
An analysis of regions revealed average starting and mid-career salaries. California and the Northeastern region stood out for offering higher salaries compared to other regions.
#regions with highest salaries
reg <- sal_reg %>%
group_by(Region) %>%
summarise(Starting_Salary = mean(Starting_Median_Salary),
Mid_Career_Salary = mean(Mid_Career_Median_Salary))
reg# A tibble: 5 × 3
Region Starting_Salary Mid_Career_Salary
<fct> <dbl> <dbl>
1 California 50073. 91718.
2 Midwestern 44461. 78178.
3 Northeastern 48679. 90929.
4 Southern 44946. 80018.
5 Western 44932. 79541.
A comparative analysis of starting salaries across different schools highlighted variations. Ivy League schools emerged with the highest starting salaries, followed by Engineering and Liberal Arts institutions.
#starting salary comparison for different types of schools
type <- sal_type %>%
group_by(School_Type) %>%
summarise(Starting_Salary = mean(Starting_Median_Salary)) %>%
arrange(Starting_Salary)
type# A tibble: 5 × 2
School_Type Starting_Salary
<fct> <dbl>
1 State 44126.
2 Party 45879.
3 Liberal Arts 46171.
4 Engineering 57440
5 Ivy League 60475
4. Data Visualizations
Top 10 Careers with Fastest Growth
This histogram visually represents the top 10 careers with the fastest growth, emphasizing the career percentage growth for each major.
# Top 10 careers with fastest growth
ggplot(data = growth,aes(reorder(College_Major,Career_Percent_Growth ),
Career_Percent_Growth,fill= College_Major)) +
geom_histogram(stat = "identity") + coord_flip() +
theme_minimal() + theme(legend.position = "None") +
labs(x= "College Major", y = "Career Percentage Growth",
title = "Top 10 careers with fastest growth")Starting and Mid-Career Salaries by Major
I generated two plots to illustrate the starting and mid-career salaries for the top-growing majors. The comparison provides insights into salary trajectories for different disciplines. The grey bars on the left plot represent mid-career pay as a point of reference. Similar to this, the darker tint in the right plot shows the initial pay.
#starting and mid career salaries by major
start_plot <- ggplot(growth, aes(x = reorder(College_Major, Starting_Median_Salary), Starting_Median_Salary)) +
geom_col(fill = "green", alpha = 0.5) +
geom_col(aes(x = reorder(College_Major, Mid_Career_Median_Salary), Mid_Career_Median_Salary), alpha = 0.3) +
geom_text(aes(label = dollar(Starting_Median_Salary)), size = 3, hjust = 1.1) +
scale_y_continuous(labels = dollar) +
labs(x= NULL , y= "Starting Median Salary",title = "Starting salary") +
coord_flip()
mid_plot <- ggplot(growth, aes(x = reorder(College_Major, Mid_Career_Median_Salary), Mid_Career_Median_Salary)) +
geom_col(alpha = 0.5, fill = 'purple') +
geom_col(aes(x = reorder(College_Major, Mid_Career_Median_Salary), Starting_Median_Salary), alpha = 0.4) +
geom_text(aes(label = dollar(Mid_Career_Median_Salary)), size = 3, hjust = -0.1) +
scale_fill_manual(values = c('green', 'purple')) +
scale_y_reverse(labels = dollar) +
scale_x_discrete(position = 'top') +
labs(x= NULL , y= "Mid Career Median Salary",title = "Mid-career salary") +
coord_flip()
#arrange plot
grid.arrange(start_plot, mid_plot, nrow = 1)The three occupations with the greatest median beginning wages are engineering, computer science, and two health-related fields. What about prospective long-term pay? Once more, engineering leads the pack, followed by economics and a few of other STEM fields.
Salaries by Region
I then created a line plot. This visually displays starting and mid-career salaries across different regions. This visualization helps to identify regional variations in salary trends.
##regions with highest salaries
ggplot(data = reg) +
geom_path(aes(x = Region, y = Starting_Salary, group = 1, color = "Starting Salary"), size = 1) +
geom_point(aes(x = Region, y = Starting_Salary, color = "Starting Salary"), size = 2) +
geom_path(aes(x = Region, y = Mid_Career_Salary, group = 1, color = "Mid Career Salary"), size = 1) +
geom_point(aes(x = Region, y = Mid_Career_Salary, color = "Mid Career Salary"), size = 2) +
labs(y = "Average Salaries", title = "Starting and Mid-Career Salaries by Region") +
theme_minimal() +
scale_color_manual(name = "Salary Type", values = c("Starting Salary" = "red", "Mid Career Salary" = "blue")) +
guides(color = guide_legend(title = NULL))#Percentile_75 of salary
ggplot(data = sal_reg,aes(Percentile_75,fill= Region)) +
geom_density(alpha = 0.4) + theme_minimal()+
xlab("75th Percentile of Salary")California and the Northeastern region offer higher salaries than other regions in US. Interestingly, it is clearly visible that the median salary growth from the starting to the 75th percentile salary is consistent throughout all the regions (those who start off higher, finish higher).
#college names and type from college data
cols <- colnames(sal_type) %in% c('School_Name', 'School_Type')
# merge sal_reg and sal_type
df_reg <- merge(x = sal_reg, y = sal_type[, cols],by = 'School_Name')
ggplot(df_reg, aes(Region, fill = School_Type)) +
geom_bar(position = 'dodge') +
scale_fill_brewer(palette = 'Set1') +
theme(legend.position = "bottom")Starting Salaries by School Type
A boxplot was employed to showcase starting salaries for different types of schools. This graphical representation effectively communicates the distribution of starting salaries. State schools make up most of the institutions in the data collection. As the preceding summary illustrates, several universities provide more than one kind. Randolph-Macon College is the only liberal arts college that simultaneously doubles as a party school.
#starting salary comparison for different types of schools
ggplot(data = sal_type ,aes(School_Type,Starting_Median_Salary,
fill= School_Type)) +
geom_boxplot() + theme_minimal() + theme(legend.position = "None") +
labs(y = "Starting Salary",title = "Starting salaries for different schools")A polar bar chart offers a clear overview of the distribution of school types, providing a visual understanding of the prevalence of each type.
#Distribution of School Types
sal_type %>%
group_by(School_Type) %>%
summarise(n = n()) %>%
ggplot(aes(x = "", y = n, fill = School_Type)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
theme_minimal() +
labs(x = NULL, y = NULL, fill = NULL, title = "Distribution of School Types") +
theme(axis.text = element_blank(), # Remove axis labels
axis.title = element_blank()) + # Remove axis titles
geom_text(aes(label = paste0(round(n/sum(n) * 100, 1), "%")), position = position_stack(vjust = 0.5)) +
scale_fill_brewer("Blues")Salary Distributions
Histograms and density plots were used to visualize the distributions of starting and mid-career salaries. These visualizations provide insights into the spread and central tendencies of salary data. The initial median pay distribution is considerably skewed to the right and is unquestionably concentrated towards the lower end of the income spectrum. While there is a maximum median beginning wage of $75,500, most school graduates begin their careers with a median of $45,100. The 50th percentile salary dispersion grows further distributed as working hours increase towards mid-career, with the median of these rising to $82,700.
# select starting and mid-career salaries and reformat to long
sal_type %>%
select(Starting_Median_Salary, Mid_Career_Median_Salary) %>%
gather(timeline, salary) %>%
ggplot(aes(salary, fill = timeline)) +
geom_density(alpha = 0.2, color = NA) +
geom_histogram(aes(y = ..density..), alpha = 0.5, position = 'dodge') +
scale_fill_manual(values = c('green', 'purple')) + labs(fill="")+
scale_x_continuous(labels = dollar) + theme_minimal()+
theme(legend.position = "bottom",
axis.text.y = element_blank(), axis.ticks.y = element_blank())Communication
Key Findings
Majors with Fastest Growth: Mathematics, Philosophy, and International Relations exhibit the fastest salary growth.
Regional Disparities: California and the Northeastern region offer the highest average salaries.
Impact of School Type: Ivy League schools have the highest starting salaries, with notable variations across different school types.
Actionable Insights
For individuals seeking higher salaries:
Consider majors with faster career growth.
Explore opportunities in regions like California and the Northeast.
Weigh the benefits of an Ivy League education.
Limitations
Acknowledging potential limitations, such as data biases and limitations in the data set, is crucial. Ethical considerations, including privacy and data usage, were throughout the analysis. The data set is available locally with free access and usage.
Conclusion
In summary, this thorough data study offers insightful information on the variables affecting graduate income. Through the use of powerful visualizations and targeted research questions, the analysis helps people make well-informed judgement about their education and career choices. The results provide useful information and further knowledge of the intricate relationship between college options and potential wages.