Choosing a college major is a complex decision, involving personal interests, academic challenges, and career potential. If you want to dive into the short and long-term financial implications of this crucial choice, join me in this project where we’ll apply clustering analysis in R to explore salary growth by college major.

Using data from PayScale’s “Degrees That Pay You Back,” as referenced in the Wall Street Journal article Ivy League’s Big Edge: Starting Pay (https://www.wsj.com/articles/SB121746658635199271), we’ll start by cleaning and preprocessing the data. Then, we’ll implement k-means clustering to group majors based on salary percentiles at different career stages. We’ll visualize these clusters to understand which majors offer the best financial prospects and identify outliers. Finally, we’ll compare and interpret the clusters to reveal insights about career growth potential for different fields of study.

This analysis can be extended with additional datasets to explore how factors like the type and region of college attended impact salary potential.

# Read in data from 'college-salaries/degrees-that-pay-back.csv'
df <- read_csv("degrees_that_pay_back.csv", col_types = cols())
head(df)
## # A tibble: 6 × 8
##   `Undergraduate Major` `Starting Median Salary` `Mid-Career Median Salary`
##   <chr>                 <chr>                    <chr>                     
## 1 Accounting            $46,000.00               $77,100.00                
## 2 Aerospace Engineering $57,700.00               $101,000.00               
## 3 Agriculture           $42,600.00               $71,900.00                
## 4 Anthropology          $36,800.00               $61,500.00                
## 5 Architecture          $41,600.00               $76,800.00                
## 6 Art History           $35,800.00               $64,900.00                
## # ℹ 5 more variables:
## #   `Percent change from Starting to Mid-Career Salary` <dbl>,
## #   `Mid-Career 10th Percentile Salary` <chr>,
## #   `Mid-Career 25th Percentile Salary` <chr>,
## #   `Mid-Career 75th Percentile Salary` <chr>,
## #   `Mid-Career 90th Percentile Salary` <chr>
sapply(df, class)
##                               Undergraduate Major 
##                                       "character" 
##                            Starting Median Salary 
##                                       "character" 
##                          Mid-Career Median Salary 
##                                       "character" 
## Percent change from Starting to Mid-Career Salary 
##                                         "numeric" 
##                 Mid-Career 10th Percentile Salary 
##                                       "character" 
##                 Mid-Career 25th Percentile Salary 
##                                       "character" 
##                 Mid-Career 75th Percentile Salary 
##                                       "character" 
##                 Mid-Career 90th Percentile Salary 
##                                       "character"
# Remove spaces from column names for easier calling
names(df) <- make.names(names(df), unique=TRUE)
colnames(df) <- c('College.Major','Starting.Median.Salary','Mid.Career.Median.Salary','Career.Percent.Growth',
                  'Percentile.10','Percentile.25','Percentile.75','Percentile.90')
# Convert relevant columns to numeric and strip currency symbols (all except College.Major)
majors = df['College.Major']

salaries <- df %>%
    select(-College.Major) %>%
    mutate_all(function(x) as.numeric(gsub("[\\$,]","",x))) %>%
    mutate(Career.Percent.Growth = Career.Percent.Growth/100)
               
df = bind_cols(majors,salaries)
head(df)
## # A tibble: 6 × 8
##   College.Major         Starting.Median.Salary Mid.Career.Median.Salary
##   <chr>                                  <dbl>                    <dbl>
## 1 Accounting                             46000                    77100
## 2 Aerospace Engineering                  57700                   101000
## 3 Agriculture                            42600                    71900
## 4 Anthropology                           36800                    61500
## 5 Architecture                           41600                    76800
## 6 Art History                            35800                    64900
## # ℹ 5 more variables: Career.Percent.Growth <dbl>, Percentile.10 <dbl>,
## #   Percentile.25 <dbl>, Percentile.75 <dbl>, Percentile.90 <dbl>
sapply(df,class)
##            College.Major   Starting.Median.Salary Mid.Career.Median.Salary 
##              "character"                "numeric"                "numeric" 
##    Career.Percent.Growth            Percentile.10            Percentile.25 
##                "numeric"                "numeric"                "numeric" 
##            Percentile.75            Percentile.90 
##                "numeric"                "numeric"
# let's begin our clustering analysis using the Elbow Method to determine how many clusters we should be modeling

# Starting.Median.Salary, Mid.Career.Median.Salary, Percentile.10, and Percentile.90
k_means_data <- df %>%
    select(Starting.Median.Salary, Mid.Career.Median.Salary, Percentile.10, Percentile.90) %>%
    scale()
# Run 15 k_means algorithms using 1:20 clusters and save the total within cluster sum of squares for each model.
set.seed(7)
max_k <- 20
wss <- sapply(1:max_k, function(k){kmeans(k_means_data, k, nstart=50, iter.max = 15)$tot.withinss})

clusters = as.data.frame(wss)
clusters$k = 1:nrow(clusters)
ggplot(clusters,aes(x=k,y=wss)) +
    geom_point() +
    geom_line() +
    xlab('Number of Clusters') +
    ylab('Within groups sum of squares')

Looks like our optimal number of clusters might settle somewhere around 5 based on the graph above. Using this information, we can now run our k-means algorithm on the selected data and add the resulting cluster information to label our original dataframe.

# Set k equal to the number of clusters corresponding to the elbow location in the graph above
num_clusters <- 5

# Run the k-means algorithm with the specified number of clusters
k_means <- kmeans(k_means_data , num_clusters , iter.max = 15, nstart = 50)
df$clusters <- k_means[[1]]
ggplot(df, aes(x=Starting.Median.Salary,y=Mid.Career.Median.Salary,color=factor(clusters))) +
    geom_point(alpha=4/5,size=6) +
    scale_x_continuous(labels = scales::comma, limits=c(30000,110000)) +
    scale_y_continuous(labels = scales::comma, limits=c(30000,110000)) +
    xlab('Starting Median Salary') +
    ylab('Mid Career Median Salary') +
    coord_fixed() +
    scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7", "#FFDD30", "#623AD1", "#FFA830")) +
    ggtitle('Clusters by Starting vs. Mid Career Median Salaries')

As expected, most data points are concentrated in the top left corner, showing a relatively linear relationship between starting and mid-career salaries—higher starting salaries often lead to higher mid-career salaries. The five clusters clearly highlight this trend, though there’s some overlap between clusters 4 and 5 and a few outliers in clusters 3 and 4. To better understand these anomalies, we should delve deeper into the mid-career percentiles and examine the contents of each cluster.

To do this, we can reshape our data to display the salaries for each college major within its respective cluster, grouped by mid-career percentiles. Visualizing this will help us explore the range of mid-career salary potential across clusters. We’ll also use ggplot’s facet_wrap function to compare each cluster side by side easily.

df_perc <- df %>%
    select(College.Major, Percentile.10, Percentile.25, Mid.Career.Median.Salary,Percentile.75, Percentile.90, clusters) %>%
    gather(key=percentile, value=salary, -c(College.Major, clusters))  

# Order the factor levels of percentile so Mid.Career.Median.Salary falls in the middle when graphing 
df_perc <- df_perc %>%
    mutate(percentile = factor(percentile, 
                               levels = c('Percentile.10', 
                                          'Percentile.25', 
                                          'Mid.Career.Median.Salary', 
                                          'Percentile.75', 
                                          'Percentile.90')))

# Graph the range of percentiles by major and cluster side by side
ggplot(df_perc, aes(x=percentile,y=salary, group=College.Major, color=factor(clusters))) +
    geom_point() +
    geom_line() +
    facet_wrap(~ clusters, scales = "free_x") +
    ggtitle('Clusters by Mid Career Percentile') +
    scale_color_manual(values=c("#EC2C73","#29AEC7", "#FFDD30", "#623AD1", "#FFA830")) +
    theme(legend.position="none",axis.text.x = element_text(size=7, angle=25))

Accountants are known for having stable job security, but once you’re in the big leagues it looks like Marketing, Philosophy, and Political Science are the winners in this cluster within the 90th percentile. These majors are fairly middle of the road in this dataset, starting off not too low and not too high in the lowest percentile, with solid growth potential.

ggplot(df_perc[df_perc$clusters==1,], aes(x=percentile,y=salary, group=College.Major, color=College.Major, order=salary)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 1:  The Goldilocks') +
    facet_wrap(~ clusters) +
    theme(axis.text.x = element_text(size=7, angle=30))

Nursing stands out as the outlier in cluster 2, offering a stronger safety net from the lowest percentile up to the median. In contrast, the other majors in this cluster, like Spanish and Religion, show limited growth potential, particularly within the 90th percentile.

ggplot(df_perc[df_perc$clusters==2,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 2: The Glass Ceiling') +
    facet_wrap(~ clusters) +
    theme(axis.text.x = element_text(size=7, angle=30)) 

These liberal arts majors might find themselves in a precarious position at the lowest percentile, but there’s still hope for those who persevere. The added insight from the mid-career percentile range helps distinguish this cluster from the overlap seen with cluster 2. Music emerges as the riskiest major with the lowest 10th percentile salary, while Drama claims the top spot for growth potential in the 90th percentile.

ggplot(df_perc[df_perc$clusters==3,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 3:  The Liberal Arts') +
    facet_wrap(~ clusters) +
    theme(axis.text.x = element_text(size=7, angle=30)) 

The next is the over-acheiver cluster.

ggplot(df_perc[df_perc$clusters==3,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 3:  The Liberal Arts') +
    facet_wrap(~ clusters) +
    theme(axis.text.x = element_text(size=7, angle=30)) 

Aside from the previously noted outlier, now identified as the Physician Assistant, this cluster showcases the financially secure path of engineering majors. They start strong with the highest 10th percentile rankings and secure the second highest in the 90th percentile.

ggplot(df_perc[df_perc$clusters==5,], aes(x=percentile,y=salary, group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 5:  The Engineers') +
    facet_wrap(~ clusters) +
    theme(axis.text.x = element_text(size=7, angle=30)) 

Thank you! Hope you learned something new!