Imputing Missing Values in the College Scorecard Data Via Random Forest

Loading the Data and Counting the Missing Values
Substituting Missing Values Using Random Forest
Clustering Results with Random Forest Imputation
Plotting Clustering Results
Summary

Let’s work on imputing missing values in the College Scorecard data so we can do clustering. This time, let’s try using random forest.

Loading the Data and Counting the Missing Values

Let’s open up the filtered, processed version of the College Scorecard data we have worked on together and settled on using at this point.

library(readr)
colleges <- read_csv("data/scorecard_reduced_features.csv", na = c("NULL", "PrivacySuppressed"))
dim(colleges)

## [1] 2473   57

How many NA values do we have in each column?

sapply(colleges, function(x) sum(is.na(x)))

##                    UNITID                    INSTNM                       ZIP 
##                         0                         0                         0 
##                      HCM2                   CONTROL                    LOCALE 
##                         0                         0                       115 
##                   CCBASIC                      HBCU                       PBI 
##                       200                       115                       115 
##                     ANNHI                    TRIBAL                   AANAPII 
##                       115                       115                       115 
##                       HSI                     NANTI                   MENONLY 
##                       115                       115                       115 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                       115                       519                      1036 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                      1036                       831                       115 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                       338                       338                       338 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                       338                       115                       130 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                       142                       519                       679 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                       664                       767                       904 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                       713                       349                       335 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                       545                       516                      1457 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                       585                       281                       305 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                       368                       450                       450 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                       450                       351                       390 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                       398                       492                       388 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                       426                       306                       306 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                       306                       458                       306

sapply(colleges, function(x) mean(is.na(x)))

##                    UNITID                    INSTNM                       ZIP 
##                0.00000000                0.00000000                0.00000000 
##                      HCM2                   CONTROL                    LOCALE 
##                0.00000000                0.00000000                0.04650222 
##                   CCBASIC                      HBCU                       PBI 
##                0.08087343                0.04650222                0.04650222 
##                     ANNHI                    TRIBAL                   AANAPII 
##                0.04650222                0.04650222                0.04650222 
##                       HSI                     NANTI                   MENONLY 
##                0.04650222                0.04650222                0.04650222 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                0.04650222                0.20986656                0.41892438 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                0.41892438                0.33602911                0.04650222 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                0.13667610                0.13667610                0.13667610 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                0.13667610                0.04650222                0.05256773 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                0.05742014                0.20986656                0.27456531 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                0.26849980                0.31014962                0.36554792 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                0.28831379                0.14112414                0.13546300 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                0.22038011                0.20865346                0.58916296 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                0.23655479                0.11362717                0.12333199 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                0.14880712                0.18196522                0.18196522 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                0.18196522                0.14193288                0.15770319 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                0.16093813                0.19894865                0.15689446 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                0.17226041                0.12373635                0.12373635 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                0.12373635                0.18520016                0.12373635

Most of those are not so bad, many around 15% or less.

Substituting Missing Values Using Random Forest

Let’s try imputing the missing values using the missForest package, and implementation of random forest for missing value imputation.

library(missForest)
colleges[,c(4:16,21)] <- lapply(colleges[,c(4:16,21)], as.factor)
colleges_subset <- colleges[4:57]
impute_results <- missForest(as.data.frame(colleges_subset), verbose = TRUE)

##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1287673 0.08863312 
##     difference(s): 0.00944485 0.006296574 
##     time: 110.618 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1238418 0.08551048 
##     difference(s): 0.001349096 0.003725955 
##     time: 110.275 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.124408 0.08587143 
##     difference(s): 0.0002890115 0.003350471 
##     time: 110.33 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1237487 0.08709473 
##     difference(s): 0.0001800123 0.00248397 
##     time: 109.943 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.1220141 0.08599062 
##     difference(s): 0.0001713945 0.0008376177 
##     time: 109.909 seconds
## 
##   missForest iteration 6 in progress...done!
##     estimated error(s): 0.1214519 0.08624852 
##     difference(s): 0.0001789926 0.001010918 
##     time: 111.994 seconds

Did this work?

sapply(as.data.frame(impute_results$ximp), function(x) mean(is.na(x)))

##                      HCM2                   CONTROL                    LOCALE 
##                         0                         0                         0 
##                   CCBASIC                      HBCU                       PBI 
##                         0                         0                         0 
##                     ANNHI                    TRIBAL                   AANAPII 
##                         0                         0                         0 
##                       HSI                     NANTI                   MENONLY 
##                         0                         0                         0 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                         0                         0                         0 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                         0                         0                         0 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                         0                         0                         0 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                         0                         0                         0 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                         0                         0                         0 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                         0                         0                         0 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                         0                         0                         0 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                         0                         0                         0 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                         0                         0                         0 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                         0                         0                         0 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                         0                         0                         0 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                         0                         0                         0 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                         0                         0                         0 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                         0                         0                         0

What is the estimate of the OOB (out-of-bag) imputation error (normalized root mean squared error)?

impute_results$OOBerror

##      NRMSE        PFC 
## 0.12201409 0.08599062

Clustering Results with Random Forest Imputation

Let’s do k-means clustering now and see what kind of results we get. Let’s use the broom package to explore how many clusters may be appropriate for this data set. For starters, let’s try out 4 to 20 clusters.

library(dplyr)
library(broom)
library(tidyr)

colleges_imputed <- as.data.frame(impute_results$ximp)

set.seed(1234)
kclusts <- data.frame(k=4:20) %>% 
    group_by(k) %>% 
    do(kclust = kmeans(colleges_imputed, .$k))

Now let’s use tidy, augment, and glance to organize these various sets of k-means clusterings.

clusters <- kclusts %>% group_by(k) %>% do(tidy(.$kclust[[1]]))
assignments <- kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], colleges_imputed))
clusterings <- kclusts %>% group_by(k) %>% do(glance(.$kclust[[1]]))

Specifically, let’s look at the total within sum of squares, saved in the tot.withinss column in clusterings. This shows us the variance within the clusters.

library(ggfortify)
ggplot(clusterings, aes(k, tot.withinss)) +
    geom_line(color = "blue", alpha = 0.5, size = 2) +
    geom_point(size = 0.8)

Notice that the variance decreases with more clusters, but there is a “bend” or “elbow” around 12 clusters. This is around the same place we saw an elbow in the early subset of colleges, and tells us we don’t gain much by going beyond 12 or so clusters.

Plotting Clustering Results

Let’s look at how the colleges are distributed in the space of principal component 1 and 2, clustered with \(k = 12\).

colleges_imputed[] <- lapply(colleges_imputed[], as.numeric)
set.seed(1234)
autoplot(kmeans(colleges_imputed, 12), data = colleges_imputed, size = 3, alpha = 0.5) + 
    ggtitle("K-Means Clustering of College Scorecard Data") +
    theme(legend.position="none")

That looks REMARKABLY like the results from the clustering imputed with medians. Let’s label a few of these.

library(ggrepel)
set.seed(1234)
my_kmeans <- kmeans(colleges_imputed, 12)
autoplot(my_kmeans, data = colleges_imputed, size = 3, alpha = 0.5) + 
    ggtitle("K-Means Clustering of College Scorecard Data") +
    theme(legend.position="none") +
    geom_text_repel(aes(PC1, PC2, 
                        color = factor(my_kmeans$cluster), 
                        label = ifelse(SAT_AVG_ALL > 1475, colleges$INSTNM,'')),
                    segment.color = NA)

So it turns out the clustering is a little bit different now than before.

Summary

It’s important to remember that this fancier imputation method is still not adding any new information to the data set; it’s just a matter of which is a more appropriate way of filling in NA values. Let’s look, just for kicks, at a few of these clusters.

colleges <- colleges %>% 
    mutate(cluster = my_kmeans$cluster) 

colleges %>%
    filter(cluster == 4) %>%
    select(INSTNM)

## # A tibble: 27 × 1
##                                                  INSTNM
##                                                   <chr>
## 1      University of California-Hastings College of Law
## 2                University of California-San Francisco
## 3                 Western University of Health Sciences
## 4                                 Georgetown University
## 5                   Midwestern University-Downers Grove
## 6  Rosalind Franklin University of Medicine and Science
## 7                                    Harvard University
## 8                 Massachusetts Institute of Technology
## 9  University of Massachusetts Medical School Worcester
## 10              A T Still University of Health Sciences
## # ... with 17 more rows

colleges %>%
    filter(cluster == 7) %>%
    select(INSTNM)

## # A tibble: 6 × 1
##                                                         INSTNM
##                                                          <chr>
## 1                                              Yale University
## 2                                 Morehouse School of Medicine
## 3                                Weill Cornell Medical College
## 4              University of Texas Southwestern Medical Center
## 5 The University of Texas Health Science Center at San Antonio
## 6     The University of Texas Health Science Center at Houston

colleges %>%
    filter(cluster == 9) %>%
    select(INSTNM)

## # A tibble: 62 × 1
##                                         INSTNM
##                                          <chr>
## 1  University of Arkansas for Medical Sciences
## 2           California Institute of Technology
## 3                  California Maritime Academy
## 4             California Western School of Law
## 5                    Claremont McKenna College
## 6                        Loma Linda University
## 7                    University of the Pacific
## 8                    Samuel Merritt University
## 9                Marshall B Ketchum University
## 10           University of Southern California
## # ... with 52 more rows

Let’s look at what are the most important variables for principal component analysis, since that is so related to k-means.

library(reshape2)

colleges_imputed <- scale(colleges_imputed)
college_pca <- prcomp(colleges_imputed)
melted <- melt(college_pca$rotation[,1:9])

ggplot(data = melted[melted$Var2 == "PC1",]) +
    theme(legend.position = "none", 
          axis.text.x= element_text(angle=45, hjust = 1), 
          axis.ticks.x = element_blank()) + 
    labs(x = "Measurements in College Scorecard data",
         y = "Relative importance in principle component",
         title = "Variables in PC1") +
    geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity")

A lot of these are very important: SAT scores, the repayment data, the completion data.