Let’s work on imputing missing values in the College Scorecard data so we can do clustering. This time, let’s try using random forest.
Let’s open up the filtered, processed version of the College Scorecard data we have worked on together and settled on using at this point.
library(readr)
colleges <- read_csv("data/scorecard_reduced_features.csv", na = c("NULL", "PrivacySuppressed"))
dim(colleges)
## [1] 2473 57
How many NA
values do we have in each column?
sapply(colleges, function(x) sum(is.na(x)))
## UNITID INSTNM ZIP
## 0 0 0
## HCM2 CONTROL LOCALE
## 0 0 115
## CCBASIC HBCU PBI
## 200 115 115
## ANNHI TRIBAL AANAPII
## 115 115 115
## HSI NANTI MENONLY
## 115 115 115
## WOMENONLY ADM_RATE_ALL SATVR25
## 115 519 1036
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 1036 831 115
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 338 338 338
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 338 115 130
## PFTFAC C150_4 C150_4_BLACK
## 142 519 679
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 664 767 904
## C150_4_NRA PCTFLOAN UG25abv
## 713 349 335
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 545 516 1457
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 585 281 305
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 368 450 450
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 450 351 390
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 398 492 388
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 426 306 306
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 306 458 306
sapply(colleges, function(x) mean(is.na(x)))
## UNITID INSTNM ZIP
## 0.00000000 0.00000000 0.00000000
## HCM2 CONTROL LOCALE
## 0.00000000 0.00000000 0.04650222
## CCBASIC HBCU PBI
## 0.08087343 0.04650222 0.04650222
## ANNHI TRIBAL AANAPII
## 0.04650222 0.04650222 0.04650222
## HSI NANTI MENONLY
## 0.04650222 0.04650222 0.04650222
## WOMENONLY ADM_RATE_ALL SATVR25
## 0.04650222 0.20986656 0.41892438
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 0.41892438 0.33602911 0.04650222
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 0.13667610 0.13667610 0.13667610
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 0.13667610 0.04650222 0.05256773
## PFTFAC C150_4 C150_4_BLACK
## 0.05742014 0.20986656 0.27456531
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 0.26849980 0.31014962 0.36554792
## C150_4_NRA PCTFLOAN UG25abv
## 0.28831379 0.14112414 0.13546300
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 0.22038011 0.20865346 0.58916296
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 0.23655479 0.11362717 0.12333199
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 0.14880712 0.18196522 0.18196522
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 0.18196522 0.14193288 0.15770319
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 0.16093813 0.19894865 0.15689446
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 0.17226041 0.12373635 0.12373635
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 0.12373635 0.18520016 0.12373635
Most of those are not so bad, many around 15% or less.
Let’s try imputing the missing values using the missForest package, and implementation of random forest for missing value imputation.
library(missForest)
colleges[,c(4:16,21)] <- lapply(colleges[,c(4:16,21)], as.factor)
colleges_subset <- colleges[4:57]
impute_results <- missForest(as.data.frame(colleges_subset), verbose = TRUE)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1287673 0.08863312
## difference(s): 0.00944485 0.006296574
## time: 110.618 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1238418 0.08551048
## difference(s): 0.001349096 0.003725955
## time: 110.275 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.124408 0.08587143
## difference(s): 0.0002890115 0.003350471
## time: 110.33 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1237487 0.08709473
## difference(s): 0.0001800123 0.00248397
## time: 109.943 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.1220141 0.08599062
## difference(s): 0.0001713945 0.0008376177
## time: 109.909 seconds
##
## missForest iteration 6 in progress...done!
## estimated error(s): 0.1214519 0.08624852
## difference(s): 0.0001789926 0.001010918
## time: 111.994 seconds
Did this work?
sapply(as.data.frame(impute_results$ximp), function(x) mean(is.na(x)))
## HCM2 CONTROL LOCALE
## 0 0 0
## CCBASIC HBCU PBI
## 0 0 0
## ANNHI TRIBAL AANAPII
## 0 0 0
## HSI NANTI MENONLY
## 0 0 0
## WOMENONLY ADM_RATE_ALL SATVR25
## 0 0 0
## SATVR75 SAT_AVG_ALL DISTANCEONLY
## 0 0 0
## UGDS_WHITE UGDS_BLACK UGDS_HISP
## 0 0 0
## UGDS_ASIAN INEXPFTE AVGFACSAL
## 0 0 0
## PFTFAC C150_4 C150_4_BLACK
## 0 0 0
## C150_4_HISP C150_4_ASIAN C150_4_AIAN
## 0 0 0
## C150_4_NRA PCTFLOAN UG25abv
## 0 0 0
## LO_INC_COMP_ORIG_YR6_RT MD_INC_COMP_ORIG_YR6_RT HI_INC_COMP_ORIG_YR6_RT
## 0 0 0
## FIRSTGEN_COMP_ORIG_YR6_RT PAR_ED_PCT_1STGEN DEBT_MDN
## 0 0 0
## mn_earn_wne_p8 pct25_earn_wne_p8 pct75_earn_wne_p8
## 0 0 0
## pct90_earn_wne_p8 RPY_3YR_RT_SUPP LO_INC_RPY_3YR_RT_SUPP
## 0 0 0
## MD_INC_RPY_3YR_RT_SUPP HI_INC_RPY_3YR_RT_SUPP NONCOM_RPY_3YR_RT_SUPP
## 0 0 0
## FIRSTGEN_RPY_3YR_RT_SUPP PCT_LIBERAL_ARTS PCT_VOCATIONAL
## 0 0 0
## PCT_RELIGIOUS COST PCT_PROFESSIONAL
## 0 0 0
What is the estimate of the OOB (out-of-bag) imputation error (normalized root mean squared error)?
impute_results$OOBerror
## NRMSE PFC
## 0.12201409 0.08599062
Let’s do k-means clustering now and see what kind of results we get. Let’s use the broom package to explore how many clusters may be appropriate for this data set. For starters, let’s try out 4 to 20 clusters.
library(dplyr)
library(broom)
library(tidyr)
colleges_imputed <- as.data.frame(impute_results$ximp)
set.seed(1234)
kclusts <- data.frame(k=4:20) %>%
group_by(k) %>%
do(kclust = kmeans(colleges_imputed, .$k))
Now let’s use tidy
, augment
, and glance
to organize these various sets of k-means clusterings.
clusters <- kclusts %>% group_by(k) %>% do(tidy(.$kclust[[1]]))
assignments <- kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], colleges_imputed))
clusterings <- kclusts %>% group_by(k) %>% do(glance(.$kclust[[1]]))
Specifically, let’s look at the total within sum of squares, saved in the tot.withinss
column in clusterings
. This shows us the variance within the clusters.
library(ggfortify)
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line(color = "blue", alpha = 0.5, size = 2) +
geom_point(size = 0.8)
Notice that the variance decreases with more clusters, but there is a “bend” or “elbow” around 12 clusters. This is around the same place we saw an elbow in the early subset of colleges, and tells us we don’t gain much by going beyond 12 or so clusters.
Let’s look at how the colleges are distributed in the space of principal component 1 and 2, clustered with \(k = 12\).
colleges_imputed[] <- lapply(colleges_imputed[], as.numeric)
set.seed(1234)
autoplot(kmeans(colleges_imputed, 12), data = colleges_imputed, size = 3, alpha = 0.5) +
ggtitle("K-Means Clustering of College Scorecard Data") +
theme(legend.position="none")
That looks REMARKABLY like the results from the clustering imputed with medians. Let’s label a few of these.
library(ggrepel)
set.seed(1234)
my_kmeans <- kmeans(colleges_imputed, 12)
autoplot(my_kmeans, data = colleges_imputed, size = 3, alpha = 0.5) +
ggtitle("K-Means Clustering of College Scorecard Data") +
theme(legend.position="none") +
geom_text_repel(aes(PC1, PC2,
color = factor(my_kmeans$cluster),
label = ifelse(SAT_AVG_ALL > 1475, colleges$INSTNM,'')),
segment.color = NA)
So it turns out the clustering is a little bit different now than before.
It’s important to remember that this fancier imputation method is still not adding any new information to the data set; it’s just a matter of which is a more appropriate way of filling in NA
values. Let’s look, just for kicks, at a few of these clusters.
colleges <- colleges %>%
mutate(cluster = my_kmeans$cluster)
colleges %>%
filter(cluster == 4) %>%
select(INSTNM)
## # A tibble: 27 × 1
## INSTNM
## <chr>
## 1 University of California-Hastings College of Law
## 2 University of California-San Francisco
## 3 Western University of Health Sciences
## 4 Georgetown University
## 5 Midwestern University-Downers Grove
## 6 Rosalind Franklin University of Medicine and Science
## 7 Harvard University
## 8 Massachusetts Institute of Technology
## 9 University of Massachusetts Medical School Worcester
## 10 A T Still University of Health Sciences
## # ... with 17 more rows
colleges %>%
filter(cluster == 7) %>%
select(INSTNM)
## # A tibble: 6 × 1
## INSTNM
## <chr>
## 1 Yale University
## 2 Morehouse School of Medicine
## 3 Weill Cornell Medical College
## 4 University of Texas Southwestern Medical Center
## 5 The University of Texas Health Science Center at San Antonio
## 6 The University of Texas Health Science Center at Houston
colleges %>%
filter(cluster == 9) %>%
select(INSTNM)
## # A tibble: 62 × 1
## INSTNM
## <chr>
## 1 University of Arkansas for Medical Sciences
## 2 California Institute of Technology
## 3 California Maritime Academy
## 4 California Western School of Law
## 5 Claremont McKenna College
## 6 Loma Linda University
## 7 University of the Pacific
## 8 Samuel Merritt University
## 9 Marshall B Ketchum University
## 10 University of Southern California
## # ... with 52 more rows
Let’s look at what are the most important variables for principal component analysis, since that is so related to k-means.
library(reshape2)
colleges_imputed <- scale(colleges_imputed)
college_pca <- prcomp(colleges_imputed)
melted <- melt(college_pca$rotation[,1:9])
ggplot(data = melted[melted$Var2 == "PC1",]) +
theme(legend.position = "none",
axis.text.x= element_text(angle=45, hjust = 1),
axis.ticks.x = element_blank()) +
labs(x = "Measurements in College Scorecard data",
y = "Relative importance in principle component",
title = "Variables in PC1") +
geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity")
A lot of these are very important: SAT scores, the repayment data, the completion data.