Introduction - Why Do This Analysis?

Unsupervised Machine Learning helps us make sense of patterns in data. In particular, clustering analysis helps us find interesting ways to group items based on how similar they are. One useful application of clustering is in higher education. There are so many universities in the U.S., and so many ways to measure how ‘good’ a school is, that it’s easy to get lost in the weeds of the data. For example, consider two schools in my home state of Ohio. Should we think of the College of Wooster as a comparable school to Xavier University? Both are private colleges that enroll less than 6,000 students, accept over half of their applicants, and charge relatively high (>$40,000) tuition. When we’re classifying universities, should they be in the same group?

For this analysis, I’ll attempt to measure schools based on their selectivity. ‘Selectivity’ is an admittedly vague target, but there are a handful of meaningful statistics that represent selectivity based on university prestige and academic rigor. These are metrics such as tuition rates, admission rates, ACT scores for incoming students, and total enrollment. With clustering, there is no single ‘correct’ answer, and as an unsupervised Machine Learning technique we aren’t trying to come up with a single group of ‘best’ schools; rather, this is a way to present a useful framework for revealing patterns in school selectivity data, allowing us to compare schools on the same playing field.

Data Gathering & Exploration

For this analysis, we will make use of the IPEDS (Integrated Postsecondary Education Data System) database, a government source for higher education data. The latest data covers the 2019-2020 school year.

IPEDS publishes a Microsoft Access database every year with hundreds of variables covering thousands of higher education institutions, so we will connect to that to grab the data. The advantage of this setup is that we quickly can access data from every IPEDS table at once instead of grabbing each file individually.

This Access database contains dozens of tables, so a good first step is to explore the data from IPEDS’ site directly, downloading the accompanying README files, and examining the type of data available in each table. Let’s get to it!

Load packages and connect to the Access database

For this analysis, we need the tidyverse as well as the RODBC package, which allows connections to Access databases.

# Connect to Access db
channel19 <- odbcConnectAccess2007("C:\\Users\\jackh\\OneDrive\\Documents\\R Analytics\\Enrollment Data\\IPEDS201920.accdb")
Build the data frame

We will require data on multiple categories. Note that IPEDS stores each variable with a mnemonic - so note that I have taken steps to translate each mnemonic into a pithy but more informative variable name. For instance, in the first table I renamed the IPEDS ‘ACTCM75’ variable as ‘act_score.’ The following code condenses all the data pulls into a series of tables for clarity.

#Institutional characteristics
institution <- sqlFetch(channel19, "HD2019")
institution <- institution %>% 
  select(UNITID, INSTNM, ICLEVEL, CONTROL,  LATITUDE, LONGITUD, ADDR, CITY, STABBR, ZIP, CBSA) %>% 
  rename(school = INSTNM)

#Admissions
adm <- sqlFetch(channel19, "ADM2019")
adm <- adm %>% 
  select(UNITID, ACTCM75) %>% 
  rename(act_score = ACTCM75
         )
#DRAdmissions table
adm_yield <- sqlFetch(channel19, "DRVADM2019")
adm_yield <- adm_yield %>% 
  select(UNITID, DVADM01) %>% 
  rename(pct_admitted = DVADM01)

#Enrollment characteristics
enrollment <- sqlFetch(channel19, "DRVEF2019")
enrollment <- enrollment %>% 
  select(UNITID, EFUG, EFUGFT, EFUGPT, PCUDEEXC, RMINSTTP, RMOUSTTP, RMFRGNCP) %>% 
  rename(ug_enroll = EFUG,
         ft_enroll = EFUGFT,
         pt_enroll = EFUGPT,
         pct_online = PCUDEEXC,
         pct_instate = RMINSTTP,
         pct_outstate = RMOUSTTP,
         pct_foreign = RMFRGNCP)
enrollment_d <- sqlFetch(channel19, "EF2019D")
enrollment_d <- enrollment_d %>%
  select(UNITID, RET_PCF) %>% 
  rename(retention_rate = RET_PCF)
enrollment <- left_join(enrollment, enrollment_d, by = c("UNITID" = "UNITID"))

#Tuition
tuition <- sqlFetch(channel19, "IC2019_AY")
tuition <- tuition %>% 
  select(UNITID, TUITION2, TUITION3) %>% 
  rename(in_state_tuition = TUITION2,
         out_state_tuition = TUITION3)

Now we can join all this data together into a master data frame and filter the resulting data to only include 4-year, non-profit schools (ie, exclude 2 year/technical colleges and for-profit institutions, as those are outside the scope of this analysis).

###Join it all together###
school_data <- institution %>% 
  left_join(y = adm, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = adm_yield, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = enrollment, by = c("UNITID" = "UNITID")) %>% 
  left_join(y = tuition, by = c("UNITID" = "UNITID")) 
#And limit the analysis to 4 year private & public non-profit institutions
school_data <- school_data %>% 
  filter(ICLEVEL == 1,
         CONTROL != 3) %>% 
  select(-ICLEVEL)

#Let's create a key linking each school ID to a school, for later
school_id_key <- school_data %>% 
  select(UNITID, school)

Data Cleaning

We now have a primary data frame. Let’s take a look at it.

school_data <- school_data %>% 
  mutate(UNITID = as.character(UNITID))

glimpse(school_data)
## Rows: 2,540
## Columns: 22
## $ UNITID            <chr> "100654", "100663", "100690", "100706", "100724", "1~
## $ school            <chr> "Alabama A & M University", "University of Alabama a~
## $ CONTROL           <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1~
## $ LATITUDE          <dbl> 34.78337, 33.50570, 32.36261, 34.72456, 32.36432, 33~
## $ LONGITUD          <dbl> -86.56850, -86.79935, -86.17401, -86.64045, -86.2956~
## $ ADDR              <chr> "4900 Meridian Street", "Administration Bldg Suite 1~
## $ CITY              <chr> "Normal", "Birmingham", "Montgomery", "Huntsville", ~
## $ STABBR            <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"~
## $ ZIP               <chr> "35762", "35294-0110", "36117-3553", "35899", "36104~
## $ CBSA              <int> 26620, 13820, 33860, 26620, 33860, 46220, 46220, 266~
## $ act_score         <int> 19, 29, NA, 31, 19, NA, 31, NA, 23, 31, 29, 23, NA, ~
## $ pct_admitted      <int> 92, 74, NA, 83, 97, NA, 83, NA, 90, 81, 54, 78, 78, ~
## $ ug_enroll         <int> 5273, 13836, 365, 7989, 3750, NA, 32795, 2774, 4523,~
## $ ft_enroll         <int> 4975, 10315, 192, 6749, 3446, NA, 29135, 1194, 3505,~
## $ pt_enroll         <int> 298, 3521, 173, 1240, 304, NA, 3660, 1580, 1018, 206~
## $ pct_online        <int> 2, 17, 100, 3, 1, NA, 9, 66, 8, 2, 0, 30, 37, 0, 54,~
## $ pct_instate       <int> NA, 83, NA, 63, 57, NA, 37, NA, 90, 54, 61, 75, NA, ~
## $ pct_outstate      <int> NA, 15, NA, 35, 41, NA, 61, NA, 6, 45, 37, 22, NA, 3~
## $ pct_foreign       <int> NA, 1, NA, 2, 2, NA, 1, NA, 3, 0, 1, 2, NA, 0, NA, 0~
## $ retention_rate    <int> 57, 83, NA, 83, 61, NA, 87, NA, 66, 91, 80, 61, 25, ~
## $ in_state_tuition  <int> 8610, 8568, 9000, 9730, 8328, NA, 10780, 6180, 7752,~
## $ out_state_tuition <int> 17220, 20400, 9000, 22126, 16656, NA, 30250, 12360, ~
summary(school_data)
##     UNITID             school             CONTROL         LATITUDE     
##  Length:2540        Length:2540        Min.   :1.000   Min.   :-14.32  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.: 34.71  
##  Mode  :character   Mode  :character   Median :2.000   Median : 39.73  
##                                        Mean   :1.665   Mean   : 38.04  
##                                        3rd Qu.:2.000   3rd Qu.: 41.79  
##                                        Max.   :2.000   Max.   : 71.32  
##                                                                        
##     LONGITUD           ADDR               CITY              STABBR         
##  Min.   :-170.74   Length:2540        Length:2540        Length:2540       
##  1st Qu.: -95.98   Class :character   Class :character   Class :character  
##  Median : -84.64   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : -88.77                                                           
##  3rd Qu.: -76.65                                                           
##  Max.   : 171.38                                                           
##                                                                            
##      ZIP                 CBSA         act_score      pct_admitted   
##  Length:2540        Min.   :   -2   Min.   :16.00   Min.   :  0.00  
##  Class :character   1st Qu.:19060   1st Qu.:23.00   1st Qu.: 57.00  
##  Mode  :character   Median :31180   Median :25.00   Median : 71.00  
##                     Mean   :29206   Mean   :25.97   Mean   : 67.91  
##                     3rd Qu.:39030   3rd Qu.:28.00   3rd Qu.: 83.00  
##                     Max.   :49780   Max.   :36.00   Max.   :100.00  
##                                     NA's   :1303    NA's   :850     
##    ug_enroll       ft_enroll         pt_enroll         pct_online   
##  Min.   :    0   Min.   :    0.0   Min.   :    0.0   Min.   :  0.0  
##  1st Qu.:  305   1st Qu.:  196.8   1st Qu.:   16.0   1st Qu.:  0.0  
##  Median : 1524   Median : 1195.0   Median :  146.0   Median :  3.0  
##  Mean   : 4307   Mean   : 3244.3   Mean   : 1063.0   Mean   : 10.6  
##  3rd Qu.: 4410   3rd Qu.: 3194.5   3rd Qu.:  785.2   3rd Qu.: 12.0  
##  Max.   :98630   Max.   :98630.0   Max.   :65530.0   Max.   :100.0  
##  NA's   :96      NA's   :96        NA's   :96        NA's   :351    
##   pct_instate      pct_outstate    pct_foreign     retention_rate  
##  Min.   :  0.00   Min.   :  0.0   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 56.00   1st Qu.:  6.0   1st Qu.: 0.000   1st Qu.: 67.00  
##  Median : 78.00   Median : 19.0   Median : 1.000   Median : 76.00  
##  Mean   : 70.85   Mean   : 25.5   Mean   : 3.057   Mean   : 74.55  
##  3rd Qu.: 92.00   3rd Qu.: 39.0   3rd Qu.: 4.000   3rd Qu.: 84.00  
##  Max.   :100.00   Max.   :100.0   Max.   :45.000   Max.   :100.00  
##  NA's   :1341     NA's   :1341    NA's   :1341     NA's   :659     
##  in_state_tuition out_state_tuition
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 7070    1st Qu.:12450    
##  Median :13500    Median :20747    
##  Mean   :19914    Mean   :23465    
##  3rd Qu.:31180    3rd Qu.:32380    
##  Max.   :70400    Max.   :70400    
##  NA's   :363      NA's   :363

Though we have data for in and out of state tuition separately, we need to calculate an ‘average’ tuition rate for all students. Note that the percent of students paying in/out of state don’t add to 100 (in the data, schools can enter some students’ in-state status as ‘unknown’) so we will have to re-scale that data to calculate the average.

school_data <- school_data %>% 
  mutate(in_state_share = pct_instate/ (pct_instate + pct_outstate),
         out_state_share = 1-in_state_share,
         avg_tuition = (in_state_share * in_state_tuition) + (out_state_share * out_state_tuition)
         ) %>% 
  select(-in_state_share, -out_state_share)

summary(school_data$avg_tuition)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    7318   14937   20497   32770   58130    1347

One thing I noticed about the data in the first exploration is that the ACT score and percent admitted fields seem to have the most NAs. I wonder if these NAs are correlated. I’ll make use of the plot_histogram() feature from the DataExplorer package to give a better sense of the NA data here.

library(DataExplorer)

#Could view a table of missing data here...
#school_data %>% 
#  filter(is.na(act_score)) %>% 
#  summary()

#...or get a visual view of missing data here.
school_data %>% 
  filter(is.na(act_score)) %>% 
  plot_histogram()

This is very interesting. I noticed a few things here. 1) All the schools missing ACT data are also missing admissions yield data. 2) There doesn’t seem to be a unifying feature that links each school with missing ACT score except enrollment size - basically every school with missing ACT data has less than 2,000 students. Let’s confirm:

school_data %>% 
  filter(is.na(act_score)) %>% 
  select(ug_enroll) %>% 
  summary()
##    ug_enroll      
##  Min.   :    0.0  
##  1st Qu.:   22.0  
##  Median :  369.5  
##  Mean   : 2594.3  
##  3rd Qu.: 2059.0  
##  Max.   :98630.0  
##  NA's   :95

There are enough missing values for ACT score that substituting with a single missing value would alter the shape of the underlying distribution. Since the missing schools are so small, they likely won’t have an impact on our analysis. So let’s remove them. Also, k-means can only handle numeric data so we will remove any categorical features.

school_full <- school_data

school_data <- school_data %>% 
  filter(!is.na(act_score)) %>% 
  drop_na()

modeled_data <- school_data %>% 
  column_to_rownames(var = 'UNITID') %>% 
  mutate(pct_admitted = ifelse(is.na(pct_admitted), imputed_admissions, pct_admitted)) %>% 
  select(-ft_enroll, -pt_enroll, -pct_online, -pct_foreign, -retention_rate, -in_state_tuition, -out_state_tuition, -school,
         -CITY, -STABBR, -ZIP, -CBSA, -LATITUDE, -LONGITUD, -ADDR, -CONTROL) #These variables can explain the results of the analysis but themselves don't define 'selectivity' in our analysis

plot_histogram(modeled_data)

glimpse(modeled_data)
## Rows: 806
## Columns: 6
## $ act_score    <int> 29, 31, 19, 31, 23, 31, 29, 23, 24, 24, 23, 27, 25, 24, 2~
## $ pct_admitted <int> 74, 83, 97, 83, 90, 81, 54, 78, 59, 55, 65, 79, 50, 88, 7~
## $ ug_enroll    <int> 13836, 7989, 3750, 32795, 4523, 24594, 1209, 2192, 1008, ~
## $ pct_instate  <int> 83, 63, 57, 37, 90, 54, 61, 75, 70, 77, 18, 78, 45, 73, 8~
## $ pct_outstate <int> 15, 35, 41, 61, 6, 45, 37, 22, 30, 23, 78, 21, 52, 25, 12~
## $ avg_tuition  <dbl> 10379.020, 14157.143, 11812.163, 22899.082, 8356.500, 187~
summary(modeled_data)
##    act_score      pct_admitted      ug_enroll      pct_instate    
##  Min.   :16.00   Min.   :  5.00   Min.   :   72   Min.   :  1.00  
##  1st Qu.:24.00   1st Qu.: 57.00   1st Qu.: 1676   1st Qu.: 55.00  
##  Median :26.00   Median : 71.00   Median : 3448   Median : 74.00  
##  Mean   :26.29   Mean   : 67.16   Mean   : 7434   Mean   : 68.23  
##  3rd Qu.:29.00   3rd Qu.: 82.00   3rd Qu.: 9155   3rd Qu.: 88.75  
##  Max.   :36.00   Max.   :100.00   Max.   :76099   Max.   :100.00  
##   pct_outstate    avg_tuition   
##  Min.   : 0.00   Min.   :    0  
##  1st Qu.: 9.00   1st Qu.: 9125  
##  Median :23.00   Median :19081  
##  Mean   :27.87   Mean   :23385  
##  3rd Qu.:41.00   3rd Qu.:34985  
##  Max.   :96.00   Max.   :58130

Now that we cleaned up the data, we can look at what fields will go into our k-means analysis.

colnames(modeled_data)
## [1] "act_score"    "pct_admitted" "ug_enroll"    "pct_instate"  "pct_outstate"
## [6] "avg_tuition"

K-Means Analysis

We have a nice, tidy dataset. Note that removing the schools with missing NA values for ACT score got rid of every single other NA value, so that makes me feel better about removing them – those schools had lots of missing values so keeping them would not add value to the analysis.

Now we can start analyzing our data! K-means requires that all data is on the same magnitude, so that fields with large values or wide arrays of values don’t have an unfair influence on the dataset. We can do that easily with the scale() function.

modeled_data <- modeled_data %>% 
  scale()

summary(modeled_data)
##    act_score         pct_admitted       ug_enroll        pct_instate     
##  Min.   :-2.63655   Min.   :-3.0233   Min.   :-0.7925   Min.   :-2.7288  
##  1st Qu.:-0.58657   1st Qu.:-0.4941   1st Qu.:-0.6198   1st Qu.:-0.5369  
##  Median :-0.07408   Median : 0.1868   Median :-0.4291   Median : 0.2344  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.69466   3rd Qu.: 0.7219   3rd Qu.: 0.1852   3rd Qu.: 0.8331  
##  Max.   : 2.48839   Max.   : 1.5973   Max.   : 7.3912   Max.   : 1.2898  
##   pct_outstate      avg_tuition     
##  Min.   :-1.2484   Min.   :-1.4978  
##  1st Qu.:-0.8452   1st Qu.:-0.9133  
##  Median :-0.2181   Median :-0.2757  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5881   3rd Qu.: 0.7429  
##  Max.   : 3.0518   Max.   : 2.2253

Choosing the right value for ‘k’

While the ‘optimal’ number of clusters is ultimately a judgment call, there are a few ways to determine how many clusters we should settle with. There are three quantitative measures that help us here: 1) the ‘elbow’ method, which calculates the sum of distances between each item within a cluster for various values of k – and the ‘elbow point’ is the optimal number of clusters, since this is the point at which increasing k no longer yields a significant improvement in the total sum of squares; 2) The Silhouette Method - which calculates how closely an item is matched with other items in the same cluster and how loosely it is with other clusters, where a high average value is optimal; and 3) The Gap Statistic, which compares the difference between clusters in an observed dataset and clusters in randomly-generated datasets. These three methods are ultimately suggestions, and it’s up to the analyst to choose the optimal k.

library(factoextra)
###Choose the right k
#Elbow method
fviz_nbclust(modeled_data, kmeans, method = "wss") #This suggests that 2 or 4 is the optimal k

#Average silhouette method
fviz_nbclust(modeled_data, kmeans, method = "silhouette") #This suggests optimal k is 2

fviz_nbclust(modeled_data, kmeans, method = "gap_stat") #This suggests 10 is the optimal k

It seems like two of the three methods seem to point to 2 or 4 clusters, while the third method (the most complicated of the three, I might add) points to 10. While there is no ‘one’ answer to this question, ideally limiting the clusters somewhat would be ideal, since it is more intuitive. However, we can easily run three analyses and compare them!

Performing clustering

set.seed(1234)
k_2 <- kmeans(modeled_data, centers = 2, nstart = 50)
k_4 <- kmeans(modeled_data, centers = 4, nstart = 50)
k_10 <- kmeans(modeled_data, centers = 10, nstart = 50)

k_2$size
## [1] 599 207
k_4$size
## [1]  92 370 261  83
k_10$size
##  [1]  77  47  56 141  83  72  51  59 133  87
#Assign each school to a cluster
school_data <- school_data %>% 
  mutate(cluster_2 = k_2$cluster) %>% 
  mutate(cluster_4 = k_4$cluster) %>% 
  mutate(cluster_10 = k_10$cluster)

Analyzing results

We can start by simply summarizing each of the resulting groups across our three frameworks.

school_data %>% 
  select(cluster_2, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_2) %>% 
  summarize_all('mean')
## # A tibble: 2 x 8
##   cluster_2 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
##       <int>   <dbl>     <dbl>        <dbl>     <dbl>      <dbl>       <dbl>
## 1         1    1.39      25.1         72.2     8252.       7.21        79.8
## 2         2    1.91      29.7         52.7     5068.       3.25        34.8
## # ... with 1 more variable: avg_tuition <dbl>
school_data %>% 
  select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean')
## # A tibble: 4 x 8
##   cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
##       <int>   <dbl>     <dbl>        <dbl>     <dbl>      <dbl>       <dbl>
## 1         1    1.01      29.3         63.1    28508.      4.59         76.2
## 2         2    1.29      24.0         74.4     5749.      8.72         85.7
## 3         3    1.90      26.3         69.7     3080.      5.03         55.4
## 4         4    1.98      32.8         31.2     5280.      0.410        21.7
## # ... with 1 more variable: avg_tuition <dbl>
school_data %>% 
  select(cluster_10, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_10) %>% 
  summarize_all('mean')
## # A tibble: 10 x 8
##    cluster_10 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
##         <int>   <dbl>     <dbl>        <dbl>     <dbl>      <dbl>       <dbl>
##  1          1    1.90      23.5         55.3     1783.      8.55         56.5
##  2          2    1.02      31.4         51.4    31520.      3.62         70.0
##  3          3    1.82      27.1         76.1     4151.      4.29         28.6
##  4          4    2         26.2         74.2     2411.      4.26         76.3
##  5          5    1         26.6         75.7    21091.      6.28         84.9
##  6          6    1.26      24.1         85.9     4413.      9.25         65.2
##  7          7    2         34.0         18.6     5823.      0.510        17.2
##  8          8    1.97      30.6         58.7     4474.      0.780        44.1
##  9          9    1.14      24.3         82.7     5785.      8.19         90.7
## 10         10    1.25      22.7         55.8     4625.     11.2          88.6
## # ... with 1 more variable: avg_tuition <dbl>

We can also visualize the results. Note that this chart reduces the data into a two-dimensional graph via Principal Component Analysis, which makes the chart much less interpretable… but nonetheless gives a good sense of the rough shapes and overlaps present within our clusters.

fviz_cluster(k_2,
             data = modeled_data,
             repel = TRUE,
             geom = "point",
             ggtheme = theme_minimal(),
             title = "Clustering Results - 2 Clusters")

fviz_cluster(k_4,
             data = modeled_data,
             repel = TRUE,
             geom = "point",
             ggtheme = theme_minimal(),
             title = "Clustering Results - 4 Clusters")

fviz_cluster(k_10,
             data = modeled_data,
             repel = TRUE,
             geom = "point",
             ggtheme = theme_minimal(),
             title = "Clustering Results - 10 Clusters")

To me, the ten-group split is too many. While some of the groups stand out – from the table we can see that group 7 is clearly the most academically rigorous of the clusters with low average acceptance rates and very high tuition, and group 2 contains large schools – many of the other groups seem fairly similar along several dimensions. Depending on what you’re trying to get out of the analysis, having 10 groups might make sense: if you want a taxonomy of all schools, being more granular might be useful. However, here I am just trying to see if analytics can give us an effective first cut at describing universities. Two groups doesn’t seem descriptive enough, so sticking with 4 groups feels right.

Let’s see the summary table on the four-group split again.

library(kableExtra)
school_data %>% 
  select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean') %>%
  kable(digits=2, format = 'markdown',row.names = TRUE) %>% 
  kable_styling(full_width = T,
                font_size = 12) %>% 
  row_spec(row = 0, color = '#660033')
cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate avg_tuition
1 1 1.01 29.30 63.11 28507.59 4.59 76.17 12814.54
2 2 1.29 24.04 74.42 5749.21 8.72 85.70 13937.27
3 3 1.90 26.34 69.73 3080.32 5.03 55.44 31807.09
4 4 1.98 32.81 31.20 5280.36 0.41 21.70 50734.84

Evaluating the Clusters

  • Group 1 is the ‘large publics’ bucket - every school in this group is a public school. The average tuition in this group is quite low, with an average of <$13,000. Based on admissions standards, this group is moderately selective.
  • Group 2 is the less selective category. They have the lowest entrance test requirements and the highest share of students that come from within their state. Although not every school in this group is public (ie control is not always 1, which denotes a public school), schools in this group also charge a low average tuition.
  • Group 3 is the ‘small privates’. Although this group has, in general, similar academic requirements as the first two groups, they charge over twice as much. These schools are small, on average, and they do a better job at attracting students from out-of-state.
  • Group 4 are the ‘elites’. This group stands out from the rest on its ability to charge a lot in average tuition – over $50,000 on average! – and accepts the lowest percentage of students.

It’s interesting that even though control is not a metric included in the visualization, the four groups have a relatively clean split between public and private ones.

Now that we have a sense of what makes up our data, we can give our clusters some more informative names.

summary_table <- school_data %>% 
    select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition, retention_rate) %>% 
  group_by(cluster_4) %>% 
  summarize_all('mean')

summary_table <- summary_table %>% 
  mutate(cluster_4 = ifelse(cluster_4=='1', 'Large Publics',
                            ifelse(cluster_4=='2', 'Less Selective',
                                   ifelse(cluster_4=='3', 'Small Privates', 'Elites')))) %>% 
  rename(cluster = cluster_4,
         control = CONTROL)

kable(summary_table, 'html', digits = c(0,2,2,2,0,1,1,0,2), align = 'lcccccccc') %>% 
  kable_styling(full_width = F)
cluster control act_score pct_admitted ug_enroll pct_online pct_instate avg_tuition retention_rate
Large Publics 1.01 29.30 63.11 28508 4.6 76.2 12815 87.20
Less Selective 1.29 24.04 74.42 5749 8.7 85.7 13937 72.15
Small Privates 1.90 26.34 69.73 3080 5.0 55.4 31807 76.76
Elites 1.98 32.81 31.20 5280 0.4 21.7 50735 92.65

Let’s drive this home with some visuals.

school_data <- school_data %>% 
  select(-cluster_10) %>% 
  select(-cluster_2) %>% 
  rename(cluster= cluster_4) %>% 
  mutate(cluster = ifelse(cluster=='1', 'Large Publics',
                            ifelse(cluster=='2', 'Less Selective',
                                   ifelse(cluster=='3', 'Small Privates', 'Elites')))) %>% 
  mutate(cluster = as.factor(cluster))
library(RColorBrewer) #for pretty colors
#Average selectivity metrics by cluster
school_data %>% 
  select(cluster, act_score, pct_admitted, avg_tuition, ug_enroll) %>% 
  group_by(cluster) %>% 
  summarize_all('mean') %>% 
  pivot_longer(cols = -cluster) %>% 
  ggplot() +
  geom_col(aes(x = cluster, y = value, fill = cluster)) +
  facet_wrap(~name, scales = 'free') + 
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

#Scatterplot - tuition and act scores
school_data %>% 
  rename(undergrad_enrollment = ug_enroll) %>% 
  ggplot() +
  geom_point(aes(x = act_score, y = avg_tuition, color = cluster, size = undergrad_enrollment), alpha = 0.5) +
  scale_color_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(x = 'ACT Score (Composite, 75th Percentile)',
       y = 'Average Tuition Rate',
       title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')

Wrap-Up

So there you have it. Even though individual schools can have similar values for various selectivity measures, k-means helps us tell whether a school really is similar to another, not just based on one or two metrics, but the sum total of all the metrics that we deem important.

While this clustering exercise is inherently subjective and decidedly not a predictive model, it’s still useful. At the end of the day, this analysis helps me understand the data… and isn’t deeper understanding what analytics is all about?

A good extension of this analysis could be to use the results of these groupings in a predictive model, allowing you to build different models for each cluster of schools. This is potentially useful since each resulting model’s coefficients need only to describe a certain subset of the data as opposed to all the data, which could ultimately result in a more accurate model.

And what about those schools I brought up in the beginning, College of Wooster and Xavier? Turns out they do belong to different groups in our analysis, with Xavier in the ‘Small Privates’ group and Wooster in the ‘Elites’ group.

library(ggrepel)
school_data %>% 
  rename(undergrad_enrollment = ug_enroll) %>% 
  ggplot(aes(x = act_score, y = avg_tuition)) +
  geom_point(aes(color = cluster, size = undergrad_enrollment), alpha = 0.5) +
  geom_text_repel(aes(label=ifelse(school == 'The College of Wooster' | school == 'Xavier University', school, '')), nudge_x = 2, nudge_y = -4500, max.overlaps = 2000) +
  scale_color_brewer(palette = 'Dark2') +
  theme_minimal() +
  labs(x = 'ACT Score (Composite, 75th Percentile)',
       y = 'Average Tuition Rate',
       title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')