Let’s work on imputing missing values in the College Scorecard data so we can do clustering. First off, let’s try some very simple methods.
Let’s open up the filtered, processed version of the College Scorecard data we have worked on together and settled on using at this point.
library(readr)
colleges <- read_csv("data/scorecard_reduced_features.csv", na = c("NULL", "PrivacySuppressed"))
dim(colleges)## [1] 2473 57
How many NA values do we have in each column?
sapply(colleges, function(x) sum(is.na(x)))## UNITID INSTNM ZIP
## 0 0 0
## HCM2 CONTROL LOCALE
## 0 0 115
## CCBASIC HBCU PBI
## 200 115 115
## ANNHI TRIBAL AANAPII
## 115 115 115
## HSI NANTI MENONLY
## 115 115 115
## WOMENONLY ADM_RATE_ALL SATVR25
## 115 519 1036
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 1036 831 115
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 338 338 338
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 338 115 130
## PFTFAC C150_4 C150_4_BLACK
## 142 519 679
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 664 767 904
## C150_4_NRA PCTFLOAN UG25abv
## 713 349 335
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 545 516 1457
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 585 281 305
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 368 450 450
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 450 351 390
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 398 492 388
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 426 306 306
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 306 458 306
sapply(colleges, function(x) mean(is.na(x)))## UNITID INSTNM ZIP
## 0.00000000 0.00000000 0.00000000
## HCM2 CONTROL LOCALE
## 0.00000000 0.00000000 0.04650222
## CCBASIC HBCU PBI
## 0.08087343 0.04650222 0.04650222
## ANNHI TRIBAL AANAPII
## 0.04650222 0.04650222 0.04650222
## HSI NANTI MENONLY
## 0.04650222 0.04650222 0.04650222
## WOMENONLY ADM_RATE_ALL SATVR25
## 0.04650222 0.20986656 0.41892438
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 0.41892438 0.33602911 0.04650222
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 0.13667610 0.13667610 0.13667610
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 0.13667610 0.04650222 0.05256773
## PFTFAC C150_4 C150_4_BLACK
## 0.05742014 0.20986656 0.27456531
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 0.26849980 0.31014962 0.36554792
## C150_4_NRA PCTFLOAN UG25abv
## 0.28831379 0.14112414 0.13546300
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 0.22038011 0.20865346 0.58916296
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 0.23655479 0.11362717 0.12333199
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 0.14880712 0.18196522 0.18196522
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 0.18196522 0.14193288 0.15770319
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 0.16093813 0.19894865 0.15689446
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 0.17226041 0.12373635 0.12373635
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 0.12373635 0.18520016 0.12373635
Most of those are not so bad, many around 15% or less. We have made so much improvement!
NAFor a first try at imputing missing values, let’s just try using the median for each column. What are the medians?
library(plyr)
library(dplyr)
median_values <- colleges[,4:57] %>%
summarise_each(funs(median(., na.rm = TRUE)))
median_values## # A tibble: 1 × 54
## HCM2 CONTROL LOCALE CCBASIC HBCU PBI ANNHI TRIBAL AANAPII HSI NANTI MENONLY WOMENONLY
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2 13 21 0 0 0 0 0 0 0 0 0
## ADM_RATE_ALL SATVR25 SATVR75 SAT_AVG_ALL DISTANCEONLY
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.66635 460 570 1049 0
## # ... with 36 more variables: UGDS_WHITE <dbl>, UGDS_BLACK <dbl>, UGDS_HISP <dbl>,
## # UGDS_ASIAN <dbl>, INEXPFTE <dbl>, AVGFACSAL <dbl>, PFTFAC <dbl>, C150_4 <dbl>,
## # C150_4_BLACK <dbl>, C150_4_HISP <dbl>, C150_4_ASIAN <dbl>, C150_4_AIAN <dbl>, C150_4_NRA <dbl>,
## # PCTFLOAN <dbl>, UG25abv <dbl>, LO_INC_COMP_ORIG_YR6_RT <dbl>, MD_INC_COMP_ORIG_YR6_RT <dbl>,
## # HI_INC_COMP_ORIG_YR6_RT <dbl>, FIRSTGEN_COMP_ORIG_YR6_RT <dbl>, PAR_ED_PCT_1STGEN <dbl>,
## # DEBT_MDN <dbl>, mn_earn_wne_p8 <dbl>, pct25_earn_wne_p8 <dbl>, pct75_earn_wne_p8 <dbl>,
## # pct90_earn_wne_p8 <dbl>, RPY_3YR_RT_SUPP <dbl>, LO_INC_RPY_3YR_RT_SUPP <dbl>,
## # MD_INC_RPY_3YR_RT_SUPP <dbl>, HI_INC_RPY_3YR_RT_SUPP <dbl>, NONCOM_RPY_3YR_RT_SUPP <dbl>,
## # FIRSTGEN_RPY_3YR_RT_SUPP <dbl>, PCT_LIBERAL_ARTS <dbl>, PCT_VOCATIONAL <dbl>,
## # PCT_RELIGIOUS <dbl>, COST <dbl>, PCT_PROFESSIONAL <dbl>
Now let’s use these values in the data frame.
colleges[4:57] <- colleges[4:57] %>%
mutate_each(funs(if_else(is.na(.), median(., na.rm = TRUE), .)))Did this work?
sapply(colleges, function(x) mean(is.na(x)))## UNITID INSTNM ZIP
## 0 0 0
## HCM2 CONTROL LOCALE
## 0 0 0
## CCBASIC HBCU PBI
## 0 0 0
## ANNHI TRIBAL AANAPII
## 0 0 0
## HSI NANTI MENONLY
## 0 0 0
## WOMENONLY ADM_RATE_ALL SATVR25
## 0 0 0
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 0 0 0
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 0 0 0
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 0 0 0
## PFTFAC C150_4 C150_4_BLACK
## 0 0 0
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 0 0 0
## C150_4_NRA PCTFLOAN UG25abv
## 0 0 0
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 0 0 0
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 0 0 0
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 0 0 0
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 0 0 0
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 0 0 0
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 0 0 0
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 0 0 0
Let’s do k-means clustering now and see what kind of results we get. Let’s use the broom package to explore how many clusters may be appropriate for this data set. For starters, let’s try out 4 to 20 clusters.
library(broom)
library(tidyr)
set.seed(1234)
kclusts <- data.frame(k=4:20) %>%
group_by(k) %>%
do(kclust = kmeans(colleges[4:57], .$k))Now let’s use tidy, augment, and glance to organize these various sets of k-means clusterings.
clusters <- kclusts %>% group_by(k) %>% do(tidy(.$kclust[[1]]))
assignments <- kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], colleges[4:57]))
clusterings <- kclusts %>% group_by(k) %>% do(glance(.$kclust[[1]]))Specifically, let’s look at the total within sum of squares, saved in the tot.withinss column in clusterings. This shows us the variance within the clusters.
library(ggfortify)
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line(color = "blue", alpha = 0.5, size = 2) +
geom_point(size = 0.8)Notice that the variance decreases with more clusters, but there is a “bend” or “elbow” around 12 clusters. This is around the same place we saw an elbow in the early subset of colleges (!), and tells us we don’t gain much by going beyond 12 or so clusters.
Let’s look at how the colleges are distributed in the space of principal component 1 and 2, clustered with \(k = 12\).
set.seed(1234)
autoplot(kmeans(colleges[4:57], 12), data = colleges[4:57], size = 3, alpha = 0.5) +
ggtitle("K-Means Clustering of College Scorecard Data") +
theme(legend.position="none")Let’s label a few of thes.
library(ggrepel)
set.seed(1234)
my_kmeans <- kmeans(colleges[4:57], 12)
autoplot(my_kmeans, data = colleges[4:57], size = 3, alpha = 0.5) +
ggtitle("K-Means Clustering of College Scorecard Data") +
theme(legend.position="none") +
geom_text_repel(aes(PC1, PC2,
color = factor(my_kmeans$cluster),
label = ifelse(SAT_AVG_ALL > 1475,colleges$INSTNM,'')),
segment.color = NA)Some of these clusters are very close and/or overlapping as viewed in this space (PC1 and PC2) but they may be more separated in other components/dimensions.
I am really happy with these results; I am even super happy that, for example, University of Chicago and Harvey Mudd are in a separate cluster than MIT and Harvard. The clustering algorithm is really picking up on subtle differences in colleges here. Let’s look, just for kicks, a that small and spread out cluster that CalTech is part of.
colleges <- colleges %>%
mutate(cluster = my_kmeans$cluster)
colleges %>%
filter(cluster == 1) %>%
select(INSTNM)## # A tibble: 16 × 1
## INSTNM
## <chr>
## 1 California Institute of Technology
## 2 Morehouse School of Medicine
## 3 Des Moines University-Osteopathic Medical Center
## 4 University of Massachusetts Medical School Worcester
## 5 Washington University in St Louis
## 6 Albany Medical College
## 7 Weill Cornell Medical College
## 8 Icahn School of Medicine at Mount Sinai
## 9 SUNY College of Optometry
## 10 Oklahoma State University Center for Health Sciences
## 11 University of Texas Southwestern Medical Center
## 12 The University of Texas Medical Branch
## 13 SIT Graduate Institute
## 14 Eastern Virginia Medical School
## 15 West Virginia School of Osteopathic Medicine
## 16 Louisiana State University Health Sciences Center-Shreveport
Huh, those are almost all medical schools, or schools where the medical school is very important.