Choosing a college major is a complex decision, involving personal interests, academic challenges, and career potential. If you want to dive into the short and long-term financial implications of this crucial choice, join me in this project where we’ll apply clustering analysis in R to explore salary growth by college major.
Using data from PayScale’s “Degrees That Pay You Back,” as referenced in the Wall Street Journal article Ivy League’s Big Edge: Starting Pay (https://www.wsj.com/articles/SB121746658635199271), we’ll start by cleaning and preprocessing the data. Then, we’ll implement k-means clustering to group majors based on salary percentiles at different career stages. We’ll visualize these clusters to understand which majors offer the best financial prospects and identify outliers. Finally, we’ll compare and interpret the clusters to reveal insights about career growth potential for different fields of study.
This analysis can be extended with additional datasets to explore how factors like the type and region of college attended impact salary potential.
# Read in data from 'college-salaries/degrees-that-pay-back.csv'
df <- read_csv("degrees_that_pay_back.csv", col_types = cols())
head(df)
## # A tibble: 6 × 8
## `Undergraduate Major` `Starting Median Salary` `Mid-Career Median Salary`
## <chr> <chr> <chr>
## 1 Accounting $46,000.00 $77,100.00
## 2 Aerospace Engineering $57,700.00 $101,000.00
## 3 Agriculture $42,600.00 $71,900.00
## 4 Anthropology $36,800.00 $61,500.00
## 5 Architecture $41,600.00 $76,800.00
## 6 Art History $35,800.00 $64,900.00
## # ℹ 5 more variables:
## # `Percent change from Starting to Mid-Career Salary` <dbl>,
## # `Mid-Career 10th Percentile Salary` <chr>,
## # `Mid-Career 25th Percentile Salary` <chr>,
## # `Mid-Career 75th Percentile Salary` <chr>,
## # `Mid-Career 90th Percentile Salary` <chr>
sapply(df, class)
## Undergraduate Major
## "character"
## Starting Median Salary
## "character"
## Mid-Career Median Salary
## "character"
## Percent change from Starting to Mid-Career Salary
## "numeric"
## Mid-Career 10th Percentile Salary
## "character"
## Mid-Career 25th Percentile Salary
## "character"
## Mid-Career 75th Percentile Salary
## "character"
## Mid-Career 90th Percentile Salary
## "character"
# Remove spaces from column names for easier calling
names(df) <- make.names(names(df), unique=TRUE)
colnames(df) <- c('College.Major','Starting.Median.Salary','Mid.Career.Median.Salary','Career.Percent.Growth',
'Percentile.10','Percentile.25','Percentile.75','Percentile.90')
# Convert relevant columns to numeric and strip currency symbols (all except College.Major)
majors = df['College.Major']
salaries <- df %>%
select(-College.Major) %>%
mutate_all(function(x) as.numeric(gsub("[\\$,]","",x))) %>%
mutate(Career.Percent.Growth = Career.Percent.Growth/100)
df = bind_cols(majors,salaries)
head(df)
## # A tibble: 6 × 8
## College.Major Starting.Median.Salary Mid.Career.Median.Salary
## <chr> <dbl> <dbl>
## 1 Accounting 46000 77100
## 2 Aerospace Engineering 57700 101000
## 3 Agriculture 42600 71900
## 4 Anthropology 36800 61500
## 5 Architecture 41600 76800
## 6 Art History 35800 64900
## # ℹ 5 more variables: Career.Percent.Growth <dbl>, Percentile.10 <dbl>,
## # Percentile.25 <dbl>, Percentile.75 <dbl>, Percentile.90 <dbl>
sapply(df,class)
## College.Major Starting.Median.Salary Mid.Career.Median.Salary
## "character" "numeric" "numeric"
## Career.Percent.Growth Percentile.10 Percentile.25
## "numeric" "numeric" "numeric"
## Percentile.75 Percentile.90
## "numeric" "numeric"
# let's begin our clustering analysis using the Elbow Method to determine how many clusters we should be modeling
# Starting.Median.Salary, Mid.Career.Median.Salary, Percentile.10, and Percentile.90
k_means_data <- df %>%
select(Starting.Median.Salary, Mid.Career.Median.Salary, Percentile.10, Percentile.90) %>%
scale()
# Run 15 k_means algorithms using 1:20 clusters and save the total within cluster sum of squares for each model.
set.seed(7)
max_k <- 20
wss <- sapply(1:max_k, function(k){kmeans(k_means_data, k, nstart=50, iter.max = 15)$tot.withinss})
clusters = as.data.frame(wss)
clusters$k = 1:nrow(clusters)
ggplot(clusters,aes(x=k,y=wss)) +
geom_point() +
geom_line() +
xlab('Number of Clusters') +
ylab('Within groups sum of squares')
Looks like our optimal number of clusters might settle somewhere around
5 based on the graph above. Using this information, we can now run our
k-means algorithm on the selected data and add the resulting cluster
information to label our original dataframe.
# Set k equal to the number of clusters corresponding to the elbow location in the graph above
num_clusters <- 5
# Run the k-means algorithm with the specified number of clusters
k_means <- kmeans(k_means_data , num_clusters , iter.max = 15, nstart = 50)
df$clusters <- k_means[[1]]
ggplot(df, aes(x=Starting.Median.Salary,y=Mid.Career.Median.Salary,color=factor(clusters))) +
geom_point(alpha=4/5,size=6) +
scale_x_continuous(labels = scales::comma, limits=c(30000,110000)) +
scale_y_continuous(labels = scales::comma, limits=c(30000,110000)) +
xlab('Starting Median Salary') +
ylab('Mid Career Median Salary') +
coord_fixed() +
scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7", "#FFDD30", "#623AD1", "#FFA830")) +
ggtitle('Clusters by Starting vs. Mid Career Median Salaries')
As expected, most data points are concentrated in the top left corner,
showing a relatively linear relationship between starting and mid-career
salaries—higher starting salaries often lead to higher mid-career
salaries. The five clusters clearly highlight this trend, though there’s
some overlap between clusters 4 and 5 and a few outliers in clusters 3
and 4. To better understand these anomalies, we should delve deeper into
the mid-career percentiles and examine the contents of each cluster.
To do this, we can reshape our data to display the salaries for each college major within its respective cluster, grouped by mid-career percentiles. Visualizing this will help us explore the range of mid-career salary potential across clusters. We’ll also use ggplot’s facet_wrap function to compare each cluster side by side easily.
df_perc <- df %>%
select(College.Major, Percentile.10, Percentile.25, Mid.Career.Median.Salary,Percentile.75, Percentile.90, clusters) %>%
gather(key=percentile, value=salary, -c(College.Major, clusters))
# Order the factor levels of percentile so Mid.Career.Median.Salary falls in the middle when graphing
df_perc <- df_perc %>%
mutate(percentile = factor(percentile,
levels = c('Percentile.10',
'Percentile.25',
'Mid.Career.Median.Salary',
'Percentile.75',
'Percentile.90')))
# Graph the range of percentiles by major and cluster side by side
ggplot(df_perc, aes(x=percentile,y=salary, group=College.Major, color=factor(clusters))) +
geom_point() +
geom_line() +
facet_wrap(~ clusters, scales = "free_x") +
ggtitle('Clusters by Mid Career Percentile') +
scale_color_manual(values=c("#EC2C73","#29AEC7", "#FFDD30", "#623AD1", "#FFA830")) +
theme(legend.position="none",axis.text.x = element_text(size=7, angle=25))
Accountants are known for having stable job security, but once you’re in
the big leagues it looks like Marketing, Philosophy, and Political
Science are the winners in this cluster within the 90th percentile.
These majors are fairly middle of the road in this dataset, starting off
not too low and not too high in the lowest percentile, with solid growth
potential.
ggplot(df_perc[df_perc$clusters==1,], aes(x=percentile,y=salary, group=College.Major, color=College.Major, order=salary)) +
geom_point() +
geom_line() +
ggtitle('Cluster 1: The Goldilocks') +
facet_wrap(~ clusters) +
theme(axis.text.x = element_text(size=7, angle=30))
Nursing stands out as the outlier in cluster 2, offering a stronger
safety net from the lowest percentile up to the median. In contrast, the
other majors in this cluster, like Spanish and Religion, show limited
growth potential, particularly within the 90th percentile.
ggplot(df_perc[df_perc$clusters==2,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 2: The Glass Ceiling') +
facet_wrap(~ clusters) +
theme(axis.text.x = element_text(size=7, angle=30))
These liberal arts majors might find themselves in a precarious position
at the lowest percentile, but there’s still hope for those who
persevere. The added insight from the mid-career percentile range helps
distinguish this cluster from the overlap seen with cluster 2. Music
emerges as the riskiest major with the lowest 10th percentile salary,
while Drama claims the top spot for growth potential in the 90th
percentile.
ggplot(df_perc[df_perc$clusters==3,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 3: The Liberal Arts') +
facet_wrap(~ clusters) +
theme(axis.text.x = element_text(size=7, angle=30))
The next is the over-acheiver cluster.
ggplot(df_perc[df_perc$clusters==3,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 3: The Liberal Arts') +
facet_wrap(~ clusters) +
theme(axis.text.x = element_text(size=7, angle=30))
Aside from the previously noted outlier, now identified as the Physician
Assistant, this cluster showcases the financially secure path of
engineering majors. They start strong with the highest 10th percentile
rankings and secure the second highest in the 90th percentile.
ggplot(df_perc[df_perc$clusters==5,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 5: The Engineers') +
facet_wrap(~ clusters) +
theme(axis.text.x = element_text(size=7, angle=30))
Thank you! Hope you learned something new!