Unsupervised Machine Learning helps us make sense of patterns in data. In particular, clustering analysis helps us find interesting ways to group items based on how similar they are. One useful application of clustering is in higher education. There are so many universities in the U.S., and so many ways to measure how ‘good’ a school is, that it’s easy to get lost in the weeds of the data. For example, consider two schools in my home state of Ohio. Should we think of the College of Wooster as a comparable school to Xavier University? Both are private colleges that enroll less than 6,000 students, accept over half of their applicants, and charge relatively high (>$40,000) tuition. When we’re classifying universities, should they be in the same group?
For this analysis, I’ll attempt to measure schools based on their selectivity. ‘Selectivity’ is an admittedly vague target, but there are a handful of meaningful statistics that represent selectivity based on university prestige and academic rigor. These are metrics such as tuition rates, admission rates, ACT scores for incoming students, and total enrollment. With clustering, there is no single ‘correct’ answer, and as an unsupervised Machine Learning technique we aren’t trying to come up with a single group of ‘best’ schools; rather, this is a way to present a useful framework for revealing patterns in school selectivity data, allowing us to compare schools on the same playing field.
For this analysis, we will make use of the IPEDS (Integrated Postsecondary Education Data System) database, a government source for higher education data. The latest data covers the 2019-2020 school year.
IPEDS publishes a Microsoft Access database every year with hundreds of variables covering thousands of higher education institutions, so we will connect to that to grab the data. The advantage of this setup is that we quickly can access data from every IPEDS table at once instead of grabbing each file individually.
This Access database contains dozens of tables, so a good first step is to explore the data from IPEDS’ site directly, downloading the accompanying README files, and examining the type of data available in each table. Let’s get to it!
For this analysis, we need the tidyverse as well as the RODBC package, which allows connections to Access databases.
# Connect to Access db
channel19 <- odbcConnectAccess2007("C:\\Users\\jackh\\OneDrive\\Documents\\R Analytics\\Enrollment Data\\IPEDS201920.accdb")
We will require data on multiple categories. Note that IPEDS stores each variable with a mnemonic - so note that I have taken steps to translate each mnemonic into a pithy but more informative variable name. For instance, in the first table I renamed the IPEDS ‘ACTCM75’ variable as ‘act_score.’ The following code condenses all the data pulls into a series of tables for clarity.
#Institutional characteristics
institution <- sqlFetch(channel19, "HD2019")
institution <- institution %>%
select(UNITID, INSTNM, ICLEVEL, CONTROL, LATITUDE, LONGITUD, ADDR, CITY, STABBR, ZIP, CBSA) %>%
rename(school = INSTNM)
#Admissions
adm <- sqlFetch(channel19, "ADM2019")
adm <- adm %>%
select(UNITID, ACTCM75) %>%
rename(act_score = ACTCM75
)
#DRAdmissions table
adm_yield <- sqlFetch(channel19, "DRVADM2019")
adm_yield <- adm_yield %>%
select(UNITID, DVADM01) %>%
rename(pct_admitted = DVADM01)
#Enrollment characteristics
enrollment <- sqlFetch(channel19, "DRVEF2019")
enrollment <- enrollment %>%
select(UNITID, EFUG, EFUGFT, EFUGPT, PCUDEEXC, RMINSTTP, RMOUSTTP, RMFRGNCP) %>%
rename(ug_enroll = EFUG,
ft_enroll = EFUGFT,
pt_enroll = EFUGPT,
pct_online = PCUDEEXC,
pct_instate = RMINSTTP,
pct_outstate = RMOUSTTP,
pct_foreign = RMFRGNCP)
enrollment_d <- sqlFetch(channel19, "EF2019D")
enrollment_d <- enrollment_d %>%
select(UNITID, RET_PCF) %>%
rename(retention_rate = RET_PCF)
enrollment <- left_join(enrollment, enrollment_d, by = c("UNITID" = "UNITID"))
#Tuition
tuition <- sqlFetch(channel19, "IC2019_AY")
tuition <- tuition %>%
select(UNITID, TUITION2, TUITION3) %>%
rename(in_state_tuition = TUITION2,
out_state_tuition = TUITION3)
Now we can join all this data together into a master data frame and filter the resulting data to only include 4-year, non-profit schools (ie, exclude 2 year/technical colleges and for-profit institutions, as those are outside the scope of this analysis).
###Join it all together###
school_data <- institution %>%
left_join(y = adm, by = c("UNITID" = "UNITID")) %>%
left_join(y = adm_yield, by = c("UNITID" = "UNITID")) %>%
left_join(y = enrollment, by = c("UNITID" = "UNITID")) %>%
left_join(y = tuition, by = c("UNITID" = "UNITID"))
#And limit the analysis to 4 year private & public non-profit institutions
school_data <- school_data %>%
filter(ICLEVEL == 1,
CONTROL != 3) %>%
select(-ICLEVEL)
#Let's create a key linking each school ID to a school, for later
school_id_key <- school_data %>%
select(UNITID, school)
We now have a primary data frame. Let’s take a look at it.
school_data <- school_data %>%
mutate(UNITID = as.character(UNITID))
glimpse(school_data)
## Rows: 2,540
## Columns: 22
## $ UNITID <chr> "100654", "100663", "100690", "100706", "100724", "1~
## $ school <chr> "Alabama A & M University", "University of Alabama a~
## $ CONTROL <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1~
## $ LATITUDE <dbl> 34.78337, 33.50570, 32.36261, 34.72456, 32.36432, 33~
## $ LONGITUD <dbl> -86.56850, -86.79935, -86.17401, -86.64045, -86.2956~
## $ ADDR <chr> "4900 Meridian Street", "Administration Bldg Suite 1~
## $ CITY <chr> "Normal", "Birmingham", "Montgomery", "Huntsville", ~
## $ STABBR <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"~
## $ ZIP <chr> "35762", "35294-0110", "36117-3553", "35899", "36104~
## $ CBSA <int> 26620, 13820, 33860, 26620, 33860, 46220, 46220, 266~
## $ act_score <int> 19, 29, NA, 31, 19, NA, 31, NA, 23, 31, 29, 23, NA, ~
## $ pct_admitted <int> 92, 74, NA, 83, 97, NA, 83, NA, 90, 81, 54, 78, 78, ~
## $ ug_enroll <int> 5273, 13836, 365, 7989, 3750, NA, 32795, 2774, 4523,~
## $ ft_enroll <int> 4975, 10315, 192, 6749, 3446, NA, 29135, 1194, 3505,~
## $ pt_enroll <int> 298, 3521, 173, 1240, 304, NA, 3660, 1580, 1018, 206~
## $ pct_online <int> 2, 17, 100, 3, 1, NA, 9, 66, 8, 2, 0, 30, 37, 0, 54,~
## $ pct_instate <int> NA, 83, NA, 63, 57, NA, 37, NA, 90, 54, 61, 75, NA, ~
## $ pct_outstate <int> NA, 15, NA, 35, 41, NA, 61, NA, 6, 45, 37, 22, NA, 3~
## $ pct_foreign <int> NA, 1, NA, 2, 2, NA, 1, NA, 3, 0, 1, 2, NA, 0, NA, 0~
## $ retention_rate <int> 57, 83, NA, 83, 61, NA, 87, NA, 66, 91, 80, 61, 25, ~
## $ in_state_tuition <int> 8610, 8568, 9000, 9730, 8328, NA, 10780, 6180, 7752,~
## $ out_state_tuition <int> 17220, 20400, 9000, 22126, 16656, NA, 30250, 12360, ~
summary(school_data)
## UNITID school CONTROL LATITUDE
## Length:2540 Length:2540 Min. :1.000 Min. :-14.32
## Class :character Class :character 1st Qu.:1.000 1st Qu.: 34.71
## Mode :character Mode :character Median :2.000 Median : 39.73
## Mean :1.665 Mean : 38.04
## 3rd Qu.:2.000 3rd Qu.: 41.79
## Max. :2.000 Max. : 71.32
##
## LONGITUD ADDR CITY STABBR
## Min. :-170.74 Length:2540 Length:2540 Length:2540
## 1st Qu.: -95.98 Class :character Class :character Class :character
## Median : -84.64 Mode :character Mode :character Mode :character
## Mean : -88.77
## 3rd Qu.: -76.65
## Max. : 171.38
##
## ZIP CBSA act_score pct_admitted
## Length:2540 Min. : -2 Min. :16.00 Min. : 0.00
## Class :character 1st Qu.:19060 1st Qu.:23.00 1st Qu.: 57.00
## Mode :character Median :31180 Median :25.00 Median : 71.00
## Mean :29206 Mean :25.97 Mean : 67.91
## 3rd Qu.:39030 3rd Qu.:28.00 3rd Qu.: 83.00
## Max. :49780 Max. :36.00 Max. :100.00
## NA's :1303 NA's :850
## ug_enroll ft_enroll pt_enroll pct_online
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 305 1st Qu.: 196.8 1st Qu.: 16.0 1st Qu.: 0.0
## Median : 1524 Median : 1195.0 Median : 146.0 Median : 3.0
## Mean : 4307 Mean : 3244.3 Mean : 1063.0 Mean : 10.6
## 3rd Qu.: 4410 3rd Qu.: 3194.5 3rd Qu.: 785.2 3rd Qu.: 12.0
## Max. :98630 Max. :98630.0 Max. :65530.0 Max. :100.0
## NA's :96 NA's :96 NA's :96 NA's :351
## pct_instate pct_outstate pct_foreign retention_rate
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 56.00 1st Qu.: 6.0 1st Qu.: 0.000 1st Qu.: 67.00
## Median : 78.00 Median : 19.0 Median : 1.000 Median : 76.00
## Mean : 70.85 Mean : 25.5 Mean : 3.057 Mean : 74.55
## 3rd Qu.: 92.00 3rd Qu.: 39.0 3rd Qu.: 4.000 3rd Qu.: 84.00
## Max. :100.00 Max. :100.0 Max. :45.000 Max. :100.00
## NA's :1341 NA's :1341 NA's :1341 NA's :659
## in_state_tuition out_state_tuition
## Min. : 0 Min. : 0
## 1st Qu.: 7070 1st Qu.:12450
## Median :13500 Median :20747
## Mean :19914 Mean :23465
## 3rd Qu.:31180 3rd Qu.:32380
## Max. :70400 Max. :70400
## NA's :363 NA's :363
Though we have data for in and out of state tuition separately, we need to calculate an ‘average’ tuition rate for all students. Note that the percent of students paying in/out of state don’t add to 100 (in the data, schools can enter some students’ in-state status as ‘unknown’) so we will have to re-scale that data to calculate the average.
school_data <- school_data %>%
mutate(in_state_share = pct_instate/ (pct_instate + pct_outstate),
out_state_share = 1-in_state_share,
avg_tuition = (in_state_share * in_state_tuition) + (out_state_share * out_state_tuition)
) %>%
select(-in_state_share, -out_state_share)
summary(school_data$avg_tuition)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 7318 14937 20497 32770 58130 1347
One thing I noticed about the data in the first exploration is that the ACT score and percent admitted fields seem to have the most NAs. I wonder if these NAs are correlated. I’ll make use of the plot_histogram() feature from the DataExplorer package to give a better sense of the NA data here.
library(DataExplorer)
#Could view a table of missing data here...
#school_data %>%
# filter(is.na(act_score)) %>%
# summary()
#...or get a visual view of missing data here.
school_data %>%
filter(is.na(act_score)) %>%
plot_histogram()
This is very interesting. I noticed a few things here. 1) All the schools missing ACT data are also missing admissions yield data. 2) There doesn’t seem to be a unifying feature that links each school with missing ACT score except enrollment size - basically every school with missing ACT data has less than 2,000 students. Let’s confirm:
school_data %>%
filter(is.na(act_score)) %>%
select(ug_enroll) %>%
summary()
## ug_enroll
## Min. : 0.0
## 1st Qu.: 22.0
## Median : 369.5
## Mean : 2594.3
## 3rd Qu.: 2059.0
## Max. :98630.0
## NA's :95
There are enough missing values for ACT score that substituting with a single missing value would alter the shape of the underlying distribution. Since the missing schools are so small, they likely won’t have an impact on our analysis. So let’s remove them. Also, k-means can only handle numeric data so we will remove any categorical features.
school_full <- school_data
school_data <- school_data %>%
filter(!is.na(act_score)) %>%
drop_na()
modeled_data <- school_data %>%
column_to_rownames(var = 'UNITID') %>%
mutate(pct_admitted = ifelse(is.na(pct_admitted), imputed_admissions, pct_admitted)) %>%
select(-ft_enroll, -pt_enroll, -pct_online, -pct_foreign, -retention_rate, -in_state_tuition, -out_state_tuition, -school,
-CITY, -STABBR, -ZIP, -CBSA, -LATITUDE, -LONGITUD, -ADDR, -CONTROL) #These variables can explain the results of the analysis but themselves don't define 'selectivity' in our analysis
plot_histogram(modeled_data)
glimpse(modeled_data)
## Rows: 806
## Columns: 6
## $ act_score <int> 29, 31, 19, 31, 23, 31, 29, 23, 24, 24, 23, 27, 25, 24, 2~
## $ pct_admitted <int> 74, 83, 97, 83, 90, 81, 54, 78, 59, 55, 65, 79, 50, 88, 7~
## $ ug_enroll <int> 13836, 7989, 3750, 32795, 4523, 24594, 1209, 2192, 1008, ~
## $ pct_instate <int> 83, 63, 57, 37, 90, 54, 61, 75, 70, 77, 18, 78, 45, 73, 8~
## $ pct_outstate <int> 15, 35, 41, 61, 6, 45, 37, 22, 30, 23, 78, 21, 52, 25, 12~
## $ avg_tuition <dbl> 10379.020, 14157.143, 11812.163, 22899.082, 8356.500, 187~
summary(modeled_data)
## act_score pct_admitted ug_enroll pct_instate
## Min. :16.00 Min. : 5.00 Min. : 72 Min. : 1.00
## 1st Qu.:24.00 1st Qu.: 57.00 1st Qu.: 1676 1st Qu.: 55.00
## Median :26.00 Median : 71.00 Median : 3448 Median : 74.00
## Mean :26.29 Mean : 67.16 Mean : 7434 Mean : 68.23
## 3rd Qu.:29.00 3rd Qu.: 82.00 3rd Qu.: 9155 3rd Qu.: 88.75
## Max. :36.00 Max. :100.00 Max. :76099 Max. :100.00
## pct_outstate avg_tuition
## Min. : 0.00 Min. : 0
## 1st Qu.: 9.00 1st Qu.: 9125
## Median :23.00 Median :19081
## Mean :27.87 Mean :23385
## 3rd Qu.:41.00 3rd Qu.:34985
## Max. :96.00 Max. :58130
Now that we cleaned up the data, we can look at what fields will go into our k-means analysis.
colnames(modeled_data)
## [1] "act_score" "pct_admitted" "ug_enroll" "pct_instate" "pct_outstate"
## [6] "avg_tuition"
We have a nice, tidy dataset. Note that removing the schools with missing NA values for ACT score got rid of every single other NA value, so that makes me feel better about removing them – those schools had lots of missing values so keeping them would not add value to the analysis.
Now we can start analyzing our data! K-means requires that all data is on the same magnitude, so that fields with large values or wide arrays of values don’t have an unfair influence on the dataset. We can do that easily with the scale() function.
modeled_data <- modeled_data %>%
scale()
summary(modeled_data)
## act_score pct_admitted ug_enroll pct_instate
## Min. :-2.63655 Min. :-3.0233 Min. :-0.7925 Min. :-2.7288
## 1st Qu.:-0.58657 1st Qu.:-0.4941 1st Qu.:-0.6198 1st Qu.:-0.5369
## Median :-0.07408 Median : 0.1868 Median :-0.4291 Median : 0.2344
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.69466 3rd Qu.: 0.7219 3rd Qu.: 0.1852 3rd Qu.: 0.8331
## Max. : 2.48839 Max. : 1.5973 Max. : 7.3912 Max. : 1.2898
## pct_outstate avg_tuition
## Min. :-1.2484 Min. :-1.4978
## 1st Qu.:-0.8452 1st Qu.:-0.9133
## Median :-0.2181 Median :-0.2757
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5881 3rd Qu.: 0.7429
## Max. : 3.0518 Max. : 2.2253
While the ‘optimal’ number of clusters is ultimately a judgment call, there are a few ways to determine how many clusters we should settle with. There are three quantitative measures that help us here: 1) the ‘elbow’ method, which calculates the sum of distances between each item within a cluster for various values of k – and the ‘elbow point’ is the optimal number of clusters, since this is the point at which increasing k no longer yields a significant improvement in the total sum of squares; 2) The Silhouette Method - which calculates how closely an item is matched with other items in the same cluster and how loosely it is with other clusters, where a high average value is optimal; and 3) The Gap Statistic, which compares the difference between clusters in an observed dataset and clusters in randomly-generated datasets. These three methods are ultimately suggestions, and it’s up to the analyst to choose the optimal k.
library(factoextra)
###Choose the right k
#Elbow method
fviz_nbclust(modeled_data, kmeans, method = "wss") #This suggests that 2 or 4 is the optimal k
#Average silhouette method
fviz_nbclust(modeled_data, kmeans, method = "silhouette") #This suggests optimal k is 2
fviz_nbclust(modeled_data, kmeans, method = "gap_stat") #This suggests 10 is the optimal k
It seems like two of the three methods seem to point to 2 or 4 clusters, while the third method (the most complicated of the three, I might add) points to 10. While there is no ‘one’ answer to this question, ideally limiting the clusters somewhat would be ideal, since it is more intuitive. However, we can easily run three analyses and compare them!
set.seed(1234)
k_2 <- kmeans(modeled_data, centers = 2, nstart = 50)
k_4 <- kmeans(modeled_data, centers = 4, nstart = 50)
k_10 <- kmeans(modeled_data, centers = 10, nstart = 50)
k_2$size
## [1] 599 207
k_4$size
## [1] 92 370 261 83
k_10$size
## [1] 77 47 56 141 83 72 51 59 133 87
#Assign each school to a cluster
school_data <- school_data %>%
mutate(cluster_2 = k_2$cluster) %>%
mutate(cluster_4 = k_4$cluster) %>%
mutate(cluster_10 = k_10$cluster)
We can start by simply summarizing each of the resulting groups across our three frameworks.
school_data %>%
select(cluster_2, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_2) %>%
summarize_all('mean')
## # A tibble: 2 x 8
## cluster_2 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.39 25.1 72.2 8252. 7.21 79.8
## 2 2 1.91 29.7 52.7 5068. 3.25 34.8
## # ... with 1 more variable: avg_tuition <dbl>
school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_4) %>%
summarize_all('mean')
## # A tibble: 4 x 8
## cluster_4 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.01 29.3 63.1 28508. 4.59 76.2
## 2 2 1.29 24.0 74.4 5749. 8.72 85.7
## 3 3 1.90 26.3 69.7 3080. 5.03 55.4
## 4 4 1.98 32.8 31.2 5280. 0.410 21.7
## # ... with 1 more variable: avg_tuition <dbl>
school_data %>%
select(cluster_10, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_10) %>%
summarize_all('mean')
## # A tibble: 10 x 8
## cluster_10 CONTROL act_score pct_admitted ug_enroll pct_online pct_instate
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.90 23.5 55.3 1783. 8.55 56.5
## 2 2 1.02 31.4 51.4 31520. 3.62 70.0
## 3 3 1.82 27.1 76.1 4151. 4.29 28.6
## 4 4 2 26.2 74.2 2411. 4.26 76.3
## 5 5 1 26.6 75.7 21091. 6.28 84.9
## 6 6 1.26 24.1 85.9 4413. 9.25 65.2
## 7 7 2 34.0 18.6 5823. 0.510 17.2
## 8 8 1.97 30.6 58.7 4474. 0.780 44.1
## 9 9 1.14 24.3 82.7 5785. 8.19 90.7
## 10 10 1.25 22.7 55.8 4625. 11.2 88.6
## # ... with 1 more variable: avg_tuition <dbl>
We can also visualize the results. Note that this chart reduces the data into a two-dimensional graph via Principal Component Analysis, which makes the chart much less interpretable… but nonetheless gives a good sense of the rough shapes and overlaps present within our clusters.
fviz_cluster(k_2,
data = modeled_data,
repel = TRUE,
geom = "point",
ggtheme = theme_minimal(),
title = "Clustering Results - 2 Clusters")
fviz_cluster(k_4,
data = modeled_data,
repel = TRUE,
geom = "point",
ggtheme = theme_minimal(),
title = "Clustering Results - 4 Clusters")
fviz_cluster(k_10,
data = modeled_data,
repel = TRUE,
geom = "point",
ggtheme = theme_minimal(),
title = "Clustering Results - 10 Clusters")
To me, the ten-group split is too many. While some of the groups stand out – from the table we can see that group 7 is clearly the most academically rigorous of the clusters with low average acceptance rates and very high tuition, and group 2 contains large schools – many of the other groups seem fairly similar along several dimensions. Depending on what you’re trying to get out of the analysis, having 10 groups might make sense: if you want a taxonomy of all schools, being more granular might be useful. However, here I am just trying to see if analytics can give us an effective first cut at describing universities. Two groups doesn’t seem descriptive enough, so sticking with 4 groups feels right.
Let’s see the summary table on the four-group split again.
library(kableExtra)
school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition) %>%
group_by(cluster_4) %>%
summarize_all('mean') %>%
kable(digits=2, format = 'markdown',row.names = TRUE) %>%
kable_styling(full_width = T,
font_size = 12) %>%
row_spec(row = 0, color = '#660033')
| cluster_4 | CONTROL | act_score | pct_admitted | ug_enroll | pct_online | pct_instate | avg_tuition | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1.01 | 29.30 | 63.11 | 28507.59 | 4.59 | 76.17 | 12814.54 |
| 2 | 2 | 1.29 | 24.04 | 74.42 | 5749.21 | 8.72 | 85.70 | 13937.27 |
| 3 | 3 | 1.90 | 26.34 | 69.73 | 3080.32 | 5.03 | 55.44 | 31807.09 |
| 4 | 4 | 1.98 | 32.81 | 31.20 | 5280.36 | 0.41 | 21.70 | 50734.84 |
It’s interesting that even though control is not a metric included in the visualization, the four groups have a relatively clean split between public and private ones.
Now that we have a sense of what makes up our data, we can give our clusters some more informative names.
summary_table <- school_data %>%
select(cluster_4, CONTROL, act_score, pct_admitted, ug_enroll, pct_online, pct_instate, avg_tuition, retention_rate) %>%
group_by(cluster_4) %>%
summarize_all('mean')
summary_table <- summary_table %>%
mutate(cluster_4 = ifelse(cluster_4=='1', 'Large Publics',
ifelse(cluster_4=='2', 'Less Selective',
ifelse(cluster_4=='3', 'Small Privates', 'Elites')))) %>%
rename(cluster = cluster_4,
control = CONTROL)
kable(summary_table, 'html', digits = c(0,2,2,2,0,1,1,0,2), align = 'lcccccccc') %>%
kable_styling(full_width = F)
| cluster | control | act_score | pct_admitted | ug_enroll | pct_online | pct_instate | avg_tuition | retention_rate |
|---|---|---|---|---|---|---|---|---|
| Large Publics | 1.01 | 29.30 | 63.11 | 28508 | 4.6 | 76.2 | 12815 | 87.20 |
| Less Selective | 1.29 | 24.04 | 74.42 | 5749 | 8.7 | 85.7 | 13937 | 72.15 |
| Small Privates | 1.90 | 26.34 | 69.73 | 3080 | 5.0 | 55.4 | 31807 | 76.76 |
| Elites | 1.98 | 32.81 | 31.20 | 5280 | 0.4 | 21.7 | 50735 | 92.65 |
Let’s drive this home with some visuals.
school_data <- school_data %>%
select(-cluster_10) %>%
select(-cluster_2) %>%
rename(cluster= cluster_4) %>%
mutate(cluster = ifelse(cluster=='1', 'Large Publics',
ifelse(cluster=='2', 'Less Selective',
ifelse(cluster=='3', 'Small Privates', 'Elites')))) %>%
mutate(cluster = as.factor(cluster))
library(RColorBrewer) #for pretty colors
#Average selectivity metrics by cluster
school_data %>%
select(cluster, act_score, pct_admitted, avg_tuition, ug_enroll) %>%
group_by(cluster) %>%
summarize_all('mean') %>%
pivot_longer(cols = -cluster) %>%
ggplot() +
geom_col(aes(x = cluster, y = value, fill = cluster)) +
facet_wrap(~name, scales = 'free') +
scale_fill_brewer(palette = 'Dark2') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
#Scatterplot - tuition and act scores
school_data %>%
rename(undergrad_enrollment = ug_enroll) %>%
ggplot() +
geom_point(aes(x = act_score, y = avg_tuition, color = cluster, size = undergrad_enrollment), alpha = 0.5) +
scale_color_brewer(palette = 'Dark2') +
theme_minimal() +
labs(x = 'ACT Score (Composite, 75th Percentile)',
y = 'Average Tuition Rate',
title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')
So there you have it. Even though individual schools can have similar values for various selectivity measures, k-means helps us tell whether a school really is similar to another, not just based on one or two metrics, but the sum total of all the metrics that we deem important.
While this clustering exercise is inherently subjective and decidedly not a predictive model, it’s still useful. At the end of the day, this analysis helps me understand the data… and isn’t deeper understanding what analytics is all about?
A good extension of this analysis could be to use the results of these groupings in a predictive model, allowing you to build different models for each cluster of schools. This is potentially useful since each resulting model’s coefficients need only to describe a certain subset of the data as opposed to all the data, which could ultimately result in a more accurate model.
And what about those schools I brought up in the beginning, College of Wooster and Xavier? Turns out they do belong to different groups in our analysis, with Xavier in the ‘Small Privates’ group and Wooster in the ‘Elites’ group.
library(ggrepel)
school_data %>%
rename(undergrad_enrollment = ug_enroll) %>%
ggplot(aes(x = act_score, y = avg_tuition)) +
geom_point(aes(color = cluster, size = undergrad_enrollment), alpha = 0.5) +
geom_text_repel(aes(label=ifelse(school == 'The College of Wooster' | school == 'Xavier University', school, '')), nudge_x = 2, nudge_y = -4500, max.overlaps = 2000) +
scale_color_brewer(palette = 'Dark2') +
theme_minimal() +
labs(x = 'ACT Score (Composite, 75th Percentile)',
y = 'Average Tuition Rate',
title = 'University Selectivity - ACT Scores and Avg. Tuition by Cluster')