Final Project: Independent Data Intensive Research

ETR537

Author

Joe Owen

Published

December 10, 2023

1. Introduction

The goal of this data study is to help people make smart decisions about which college to attend by looking at graduate incomes. The goal is to find the factors that impact pay growth, differences between regions, and differences between schools.

Choosing the right college is a very difficult. There are many factors to consider, such as the amount of money that could be made in the future. Those earnings may be different depending on where you live and what kind of college you go to. The goal of this report is to look into graduate incomes and help people make job choices.

Data comes from Payscale, Inc. which is the foundation for statistics from the Wall Street Journal and downloaded from Kaggle (Salary Dataset):

  • Salary Increase by College Type

  • Salary by Region

  • Major Salary Increase

Research Questions

  1. What are the disciplines with the fastest growing salary throughout the career?

  2. Which regions offer the highest salaries ?

  3. Students from which type of schools have higher starting salaries?

#Salaries for undergraduate majors
degree <- read.csv("degrees-that-pay-back.csv",stringsAsFactors = TRUE)
#Salaries by regions
sal_reg <- read.csv("salaries-by-region.csv",na.strings = "N/A",stringsAsFactors = TRUE)
#Salaries by school type
sal_type <- read.csv("salaries-by-college-type.csv",na.strings = "N/A",stringsAsFactors = TRUE)

2. Data Wrangling and Pre-processing

Renaming variables

To enhance clarity and consistency, variables in each data-set were renamed. This step ensures a standardized approach to the subsequent analysis.

#load required libraries
library(tidyverse)
library(forcats)
library(scales)
library(gridExtra)

##### Renaming variables

#degrees-that-pay-back data
names(degree)     <- c("College_Major","Starting_Median_Salary","Mid_Career_Median_Salary",
                       "Career_Percent_Growth","Percentile_10","Percentile_25","Percentile_75",
                       "Percentile_90")
#salaries-by-region
names(sal_reg) <- c("School_Name","Region","Starting_Median_Salary","Mid_Career_Median_Salary",
                    "Percentile_10","Percentile_25","Percentile_75","Percentile_90")
#salaries-by-college-type
names(sal_type) <- c("School_Name","School_Type","Starting_Median_Salary","Mid_Career_Median_Salary",
                     "Percentile_10","Percentile_25","Percentile_75","Percentile_90")

Removing Unnecessary Terms

I removed dollar signs from salary values to convert them into numeric format, this leads to streamlining the subsequent analyses.

#degrees-that-pay-back data
degree <- degree %>% 
  mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))
#salaries-by-region
sal_reg <- sal_reg %>% 
  mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))
#salaries-by-college-type
sal_type <- sal_type %>% 
  mutate_at(vars(Starting_Median_Salary:Percentile_90), function(x) as.numeric(gsub('[\\$,]',"",x)))

Dealing with Missing Data

The presence of missing values were assessed and addressed for each data-set (removal, imputation). These steps were necessary to ensure the robustness of the analysis.

#degrees-that-pay-back data
apply(is.na(degree), 2, sum)
           College_Major   Starting_Median_Salary Mid_Career_Median_Salary 
                       0                        0                        0 
   Career_Percent_Growth            Percentile_10            Percentile_25 
                       0                        0                        0 
           Percentile_75            Percentile_90 
                       0                        0 
#salaries-by-region
apply(is.na(sal_reg), 2, sum)
             School_Name                   Region   Starting_Median_Salary 
                       0                        0                        0 
Mid_Career_Median_Salary            Percentile_10            Percentile_25 
                       0                       47                        0 
           Percentile_75            Percentile_90 
                       0                       47 
#salaries-by-college-type
apply(is.na(sal_type), 2, sum)
             School_Name              School_Type   Starting_Median_Salary 
                       0                        0                        0 
Mid_Career_Median_Salary            Percentile_10            Percentile_25 
                       0                       38                        0 
           Percentile_75            Percentile_90 
                       0                       38 
# Removing NA values
degree <- na.omit(degree)
sal_reg <- na.omit(sal_reg)
sal_type <- na.omit(sal_type)

3. Analyze

Descriptive Statistics

I identified the top 10 careers with the fastest growth based on career percentage growth. Notable examples include majors such as Mathematics, Philosophy, and International Relations, which exhibit remarkable career growth.

#careers with fastest growth
growth <- degree %>% 
  select(College_Major,Starting_Median_Salary,Mid_Career_Median_Salary,
         Career_Percent_Growth) %>%
  arrange(desc(Career_Percent_Growth)) %>%
  top_n(10)
growth
             College_Major Starting_Median_Salary Mid_Career_Median_Salary
1                     Math                  45400                    92400
2               Philosophy                  39900                    81200
3  International Relations                  40900                    80900
4                Economics                  50100                    98600
5                Marketing                  40800                    79600
6                  Physics                  50300                    97300
7        Political Science                  40800                    78200
8                Chemistry                  42600                    79900
9               Journalism                  35600                    66700
10            Architecture                  41600                    76800
   Career_Percent_Growth
1                  103.5
2                  103.5
3                   97.8
4                   96.8
5                   95.1
6                   93.4
7                   91.7
8                   87.6
9                   87.4
10                  84.6

An analysis of regions revealed average starting and mid-career salaries. California and the Northeastern region stood out for offering higher salaries compared to other regions.

#regions with highest salaries
reg <- sal_reg %>%
  group_by(Region) %>%
  summarise(Starting_Salary = mean(Starting_Median_Salary),
            Mid_Career_Salary = mean(Mid_Career_Median_Salary))
reg
# A tibble: 5 × 3
  Region       Starting_Salary Mid_Career_Salary
  <fct>                  <dbl>             <dbl>
1 California            50073.            91718.
2 Midwestern            44461.            78178.
3 Northeastern          48679.            90929.
4 Southern              44946.            80018.
5 Western               44932.            79541.

A comparative analysis of starting salaries across different schools highlighted variations. Ivy League schools emerged with the highest starting salaries, followed by Engineering and Liberal Arts institutions.

#starting salary comparison for different types of schools
type <- sal_type %>%
  group_by(School_Type)  %>%
  summarise(Starting_Salary = mean(Starting_Median_Salary)) %>%
  arrange(Starting_Salary)
type
# A tibble: 5 × 2
  School_Type  Starting_Salary
  <fct>                  <dbl>
1 State                 44126.
2 Party                 45879.
3 Liberal Arts          46171.
4 Engineering           57440 
5 Ivy League            60475 

4. Data Visualizations

Top 10 Careers with Fastest Growth

This histogram visually represents the top 10 careers with the fastest growth, emphasizing the career percentage growth for each major.

# Top 10 careers with fastest growth
ggplot(data = growth,aes(reorder(College_Major,Career_Percent_Growth ),
                         Career_Percent_Growth,fill= College_Major)) +
  geom_histogram(stat = "identity") + coord_flip() +
  theme_minimal() + theme(legend.position = "None") +
  labs(x= "College Major", y = "Career Percentage Growth",
       title = "Top 10 careers with fastest growth")

Starting and Mid-Career Salaries by Major

I generated two plots to illustrate the starting and mid-career salaries for the top-growing majors. The comparison provides insights into salary trajectories for different disciplines. The grey bars on the left plot represent mid-career pay as a point of reference. Similar to this, the darker tint in the right plot shows the initial pay.

#starting and mid career salaries by major
start_plot <- ggplot(growth, aes(x = reorder(College_Major, Starting_Median_Salary), Starting_Median_Salary)) +
  geom_col(fill = "green", alpha = 0.5) +
  geom_col(aes(x = reorder(College_Major, Mid_Career_Median_Salary), Mid_Career_Median_Salary), alpha = 0.3) +
  geom_text(aes(label = dollar(Starting_Median_Salary)), size = 3, hjust = 1.1) +
  scale_y_continuous(labels = dollar) +
  labs(x= NULL , y= "Starting Median Salary",title = "Starting salary") + 
  coord_flip() 
mid_plot <- ggplot(growth, aes(x = reorder(College_Major, Mid_Career_Median_Salary), Mid_Career_Median_Salary)) +
  geom_col(alpha = 0.5, fill = 'purple') +
  geom_col(aes(x = reorder(College_Major, Mid_Career_Median_Salary), Starting_Median_Salary), alpha = 0.4) +
  geom_text(aes(label = dollar(Mid_Career_Median_Salary)), size = 3, hjust = -0.1) +
  scale_fill_manual(values = c('green', 'purple')) +
  scale_y_reverse(labels = dollar) +
  scale_x_discrete(position = 'top') +
  labs(x= NULL , y= "Mid Career Median Salary",title = "Mid-career salary") + 
  coord_flip()
#arrange plot
grid.arrange(start_plot, mid_plot, nrow = 1)

The three occupations with the greatest median beginning wages are engineering, computer science, and two health-related fields. What about prospective long-term pay? Once more, engineering leads the pack, followed by economics and a few of other STEM fields.

Salaries by Region

I then created a line plot. This visually displays starting and mid-career salaries across different regions. This visualization helps to identify regional variations in salary trends.

##regions with highest salaries
ggplot(data = reg) +
  geom_path(aes(x = Region, y = Starting_Salary, group = 1, color = "Starting Salary"), size = 1) +
  geom_point(aes(x = Region, y = Starting_Salary, color = "Starting Salary"), size = 2) +
  geom_path(aes(x = Region, y = Mid_Career_Salary, group = 1, color = "Mid Career Salary"), size = 1) +
  geom_point(aes(x = Region, y = Mid_Career_Salary, color = "Mid Career Salary"), size = 2) +
  labs(y = "Average Salaries", title = "Starting and Mid-Career Salaries by Region") +
  theme_minimal() +
  scale_color_manual(name = "Salary Type", values = c("Starting Salary" = "red", "Mid Career Salary" = "blue")) +
  guides(color = guide_legend(title = NULL))

#Percentile_75 of salary
ggplot(data = sal_reg,aes(Percentile_75,fill= Region)) +
  geom_density(alpha = 0.4) + theme_minimal()+
  xlab("75th Percentile of Salary")

California and the Northeastern region offer higher salaries than other regions in US. Interestingly, it is clearly visible that the median salary growth from the starting to the 75th percentile salary is consistent throughout all the regions (those who start off higher, finish higher).

#college names and type from college data
cols <- colnames(sal_type) %in% c('School_Name', 'School_Type')
# merge sal_reg and  sal_type
df_reg <- merge(x = sal_reg, y = sal_type[, cols],by = 'School_Name')

ggplot(df_reg, aes(Region, fill = School_Type)) +
  geom_bar(position = 'dodge') +
  scale_fill_brewer(palette = 'Set1') +
  theme(legend.position = "bottom")

Starting Salaries by School Type

A boxplot was employed to showcase starting salaries for different types of schools. This graphical representation effectively communicates the distribution of starting salaries. State schools make up most of the institutions in the data collection. As the preceding summary illustrates, several universities provide more than one kind. Randolph-Macon College is the only liberal arts college that simultaneously doubles as a party school.

#starting salary comparison for different types of schools
ggplot(data = sal_type ,aes(School_Type,Starting_Median_Salary,
                            fill= School_Type)) +
  geom_boxplot() + theme_minimal() + theme(legend.position = "None") +
  labs(y = "Starting Salary",title = "Starting salaries for different schools")

A polar bar chart offers a clear overview of the distribution of school types, providing a visual understanding of the prevalence of each type.

#Distribution of School Types
sal_type %>%
  group_by(School_Type) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = "", y = n, fill = School_Type)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  theme_minimal() +
  labs(x = NULL, y = NULL, fill = NULL, title = "Distribution of School Types") +
  theme(axis.text = element_blank(),  # Remove axis labels
        axis.title = element_blank()) +  # Remove axis titles
  geom_text(aes(label = paste0(round(n/sum(n) * 100, 1), "%")), position = position_stack(vjust = 0.5)) +
  scale_fill_brewer("Blues")

Salary Distributions

Histograms and density plots were used to visualize the distributions of starting and mid-career salaries. These visualizations provide insights into the spread and central tendencies of salary data. The initial median pay distribution is considerably skewed to the right and is unquestionably concentrated towards the lower end of the income spectrum. While there is a maximum median beginning wage of $75,500, most school graduates begin their careers with a median of $45,100. The 50th percentile salary dispersion grows further distributed as working hours increase towards mid-career, with the median of these rising to $82,700.

# select starting and mid-career salaries and reformat to long
sal_type %>%
  select(Starting_Median_Salary, Mid_Career_Median_Salary) %>%
  gather(timeline, salary)  %>%
  ggplot(aes(salary, fill = timeline)) +
  geom_density(alpha = 0.2, color = NA) +
  geom_histogram(aes(y = ..density..), alpha = 0.5, position = 'dodge') +
  scale_fill_manual(values = c('green', 'purple')) + labs(fill="")+
  scale_x_continuous(labels = dollar) + theme_minimal()+
  theme(legend.position = "bottom",
        axis.text.y = element_blank(), axis.ticks.y = element_blank())

Communication

Key Findings

  1. Majors with Fastest Growth: Mathematics, Philosophy, and International Relations exhibit the fastest salary growth.

  2. Regional Disparities: California and the Northeastern region offer the highest average salaries.

  3. Impact of School Type: Ivy League schools have the highest starting salaries, with notable variations across different school types.

Actionable Insights

For individuals seeking higher salaries:

  • Consider majors with faster career growth.

  • Explore opportunities in regions like California and the Northeast.

  • Weigh the benefits of an Ivy League education.

Limitations

Acknowledging potential limitations, such as data biases and limitations in the data set, is crucial. Ethical considerations, including privacy and data usage, were throughout the analysis. The data set is available locally with free access and usage.

Conclusion

In summary, this thorough data study offers insightful information on the variables affecting graduate income. Through the use of powerful visualizations and targeted research questions, the analysis helps people make well-informed judgement about their education and career choices. The results provide useful information and further knowledge of the intricate relationship between college options and potential wages.