Let’s work on imputing missing values in the College Scorecard data so we can do clustering. First off, let’s try some very simple methods.

Loading the Data and Counting the Missing Values

Let’s open up the filtered, processed version of the College Scorecard data we have worked on together and settled on using at this point.

library(readr)
colleges <- read_csv("data/scorecard_reduced_features.csv", na = c("NULL", "PrivacySuppressed"))
dim(colleges)
## [1] 2473   57

How many NA values do we have in each column?

sapply(colleges, function(x) sum(is.na(x)))
##                    UNITID                    INSTNM                       ZIP 
##                         0                         0                         0 
##                      HCM2                   CONTROL                    LOCALE 
##                         0                         0                       115 
##                   CCBASIC                      HBCU                       PBI 
##                       200                       115                       115 
##                     ANNHI                    TRIBAL                   AANAPII 
##                       115                       115                       115 
##                       HSI                     NANTI                   MENONLY 
##                       115                       115                       115 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                       115                       519                      1036 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                      1036                       831                       115 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                       338                       338                       338 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                       338                       115                       130 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                       142                       519                       679 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                       664                       767                       904 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                       713                       349                       335 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                       545                       516                      1457 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                       585                       281                       305 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                       368                       450                       450 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                       450                       351                       390 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                       398                       492                       388 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                       426                       306                       306 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                       306                       458                       306
sapply(colleges, function(x) mean(is.na(x)))
##                    UNITID                    INSTNM                       ZIP 
##                0.00000000                0.00000000                0.00000000 
##                      HCM2                   CONTROL                    LOCALE 
##                0.00000000                0.00000000                0.04650222 
##                   CCBASIC                      HBCU                       PBI 
##                0.08087343                0.04650222                0.04650222 
##                     ANNHI                    TRIBAL                   AANAPII 
##                0.04650222                0.04650222                0.04650222 
##                       HSI                     NANTI                   MENONLY 
##                0.04650222                0.04650222                0.04650222 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                0.04650222                0.20986656                0.41892438 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                0.41892438                0.33602911                0.04650222 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                0.13667610                0.13667610                0.13667610 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                0.13667610                0.04650222                0.05256773 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                0.05742014                0.20986656                0.27456531 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                0.26849980                0.31014962                0.36554792 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                0.28831379                0.14112414                0.13546300 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                0.22038011                0.20865346                0.58916296 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                0.23655479                0.11362717                0.12333199 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                0.14880712                0.18196522                0.18196522 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                0.18196522                0.14193288                0.15770319 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                0.16093813                0.19894865                0.15689446 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                0.17226041                0.12373635                0.12373635 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                0.12373635                0.18520016                0.12373635

Most of those are not so bad, many around 15% or less. We have made so much improvement!

Substituting a Median Value for Each NA

For a first try at imputing missing values, let’s just try using the median for each column. What are the medians?

library(plyr)
library(dplyr)
median_values <- colleges[,4:57] %>% 
    summarise_each(funs(median(., na.rm = TRUE)))

median_values
## # A tibble: 1 × 54
##    HCM2 CONTROL LOCALE CCBASIC  HBCU   PBI ANNHI TRIBAL AANAPII   HSI NANTI MENONLY WOMENONLY
##   <dbl>   <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl>   <dbl>     <dbl>
## 1     0       2     13      21     0     0     0      0       0     0     0       0         0
##   ADM_RATE_ALL SATVR25 SATVR75 SAT_AVG_ALL DISTANCEONLY
##          <dbl>   <dbl>   <dbl>       <dbl>        <dbl>
## 1      0.66635     460     570        1049            0
## # ... with 36 more variables: UGDS_WHITE <dbl>, UGDS_BLACK <dbl>, UGDS_HISP <dbl>,
## #   UGDS_ASIAN <dbl>, INEXPFTE <dbl>, AVGFACSAL <dbl>, PFTFAC <dbl>, C150_4 <dbl>,
## #   C150_4_BLACK <dbl>, C150_4_HISP <dbl>, C150_4_ASIAN <dbl>, C150_4_AIAN <dbl>, C150_4_NRA <dbl>,
## #   PCTFLOAN <dbl>, UG25abv <dbl>, LO_INC_COMP_ORIG_YR6_RT <dbl>, MD_INC_COMP_ORIG_YR6_RT <dbl>,
## #   HI_INC_COMP_ORIG_YR6_RT <dbl>, FIRSTGEN_COMP_ORIG_YR6_RT <dbl>, PAR_ED_PCT_1STGEN <dbl>,
## #   DEBT_MDN <dbl>, mn_earn_wne_p8 <dbl>, pct25_earn_wne_p8 <dbl>, pct75_earn_wne_p8 <dbl>,
## #   pct90_earn_wne_p8 <dbl>, RPY_3YR_RT_SUPP <dbl>, LO_INC_RPY_3YR_RT_SUPP <dbl>,
## #   MD_INC_RPY_3YR_RT_SUPP <dbl>, HI_INC_RPY_3YR_RT_SUPP <dbl>, NONCOM_RPY_3YR_RT_SUPP <dbl>,
## #   FIRSTGEN_RPY_3YR_RT_SUPP <dbl>, PCT_LIBERAL_ARTS <dbl>, PCT_VOCATIONAL <dbl>,
## #   PCT_RELIGIOUS <dbl>, COST <dbl>, PCT_PROFESSIONAL <dbl>

Now let’s use these values in the data frame.

colleges[4:57] <- colleges[4:57] %>%
    mutate_each(funs(if_else(is.na(.), median(., na.rm = TRUE), .)))

Did this work?

sapply(colleges, function(x) mean(is.na(x)))
##                    UNITID                    INSTNM                       ZIP 
##                         0                         0                         0 
##                      HCM2                   CONTROL                    LOCALE 
##                         0                         0                         0 
##                   CCBASIC                      HBCU                       PBI 
##                         0                         0                         0 
##                     ANNHI                    TRIBAL                   AANAPII 
##                         0                         0                         0 
##                       HSI                     NANTI                   MENONLY 
##                         0                         0                         0 
##                 WOMENONLY              ADM_RATE_ALL                   SATVR25 
##                         0                         0                         0 
##                   SATVR75               SAT_AVG_ALL              DISTANCEONLY 
##                         0                         0                         0 
##                UGDS_WHITE                UGDS_BLACK                 UGDS_HISP 
##                         0                         0                         0 
##                UGDS_ASIAN                  INEXPFTE                 AVGFACSAL 
##                         0                         0                         0 
##                    PFTFAC                    C150_4              C150_4_BLACK 
##                         0                         0                         0 
##               C150_4_HISP              C150_4_ASIAN               C150_4_AIAN 
##                         0                         0                         0 
##                C150_4_NRA                  PCTFLOAN                   UG25abv 
##                         0                         0                         0 
##   LO_INC_COMP_ORIG_YR6_RT   MD_INC_COMP_ORIG_YR6_RT   HI_INC_COMP_ORIG_YR6_RT 
##                         0                         0                         0 
## FIRSTGEN_COMP_ORIG_YR6_RT         PAR_ED_PCT_1STGEN                  DEBT_MDN 
##                         0                         0                         0 
##            mn_earn_wne_p8         pct25_earn_wne_p8         pct75_earn_wne_p8 
##                         0                         0                         0 
##         pct90_earn_wne_p8           RPY_3YR_RT_SUPP    LO_INC_RPY_3YR_RT_SUPP 
##                         0                         0                         0 
##    MD_INC_RPY_3YR_RT_SUPP    HI_INC_RPY_3YR_RT_SUPP    NONCOM_RPY_3YR_RT_SUPP 
##                         0                         0                         0 
##  FIRSTGEN_RPY_3YR_RT_SUPP          PCT_LIBERAL_ARTS            PCT_VOCATIONAL 
##                         0                         0                         0 
##             PCT_RELIGIOUS                      COST          PCT_PROFESSIONAL 
##                         0                         0                         0

Clustering Results with Median Imputation

Let’s do k-means clustering now and see what kind of results we get. Let’s use the broom package to explore how many clusters may be appropriate for this data set. For starters, let’s try out 4 to 20 clusters.

library(broom)
library(tidyr)
set.seed(1234)
kclusts <- data.frame(k=4:20) %>% 
    group_by(k) %>% 
    do(kclust = kmeans(colleges[4:57], .$k))

Now let’s use tidy, augment, and glance to organize these various sets of k-means clusterings.

clusters <- kclusts %>% group_by(k) %>% do(tidy(.$kclust[[1]]))
assignments <- kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], colleges[4:57]))
clusterings <- kclusts %>% group_by(k) %>% do(glance(.$kclust[[1]]))

Specifically, let’s look at the total within sum of squares, saved in the tot.withinss column in clusterings. This shows us the variance within the clusters.

library(ggfortify)
ggplot(clusterings, aes(k, tot.withinss)) +
    geom_line(color = "blue", alpha = 0.5, size = 2) +
    geom_point(size = 0.8)

Notice that the variance decreases with more clusters, but there is a “bend” or “elbow” around 12 clusters. This is around the same place we saw an elbow in the early subset of colleges (!), and tells us we don’t gain much by going beyond 12 or so clusters.

Plotting Clustering Results

Let’s look at how the colleges are distributed in the space of principal component 1 and 2, clustered with \(k = 12\).

set.seed(1234)
autoplot(kmeans(colleges[4:57], 12), data = colleges[4:57], size = 3, alpha = 0.5) + 
    ggtitle("K-Means Clustering of College Scorecard Data") +
    theme(legend.position="none")

Let’s label a few of thes.

library(ggrepel)
set.seed(1234)
my_kmeans <- kmeans(colleges[4:57], 12)
autoplot(my_kmeans, data = colleges[4:57], size = 3, alpha = 0.5) + 
    ggtitle("K-Means Clustering of College Scorecard Data") +
    theme(legend.position="none") +
    geom_text_repel(aes(PC1, PC2, 
                        color = factor(my_kmeans$cluster), 
                        label = ifelse(SAT_AVG_ALL > 1475,colleges$INSTNM,'')),
                    segment.color = NA)

Some of these clusters are very close and/or overlapping as viewed in this space (PC1 and PC2) but they may be more separated in other components/dimensions.

Summary

I am really happy with these results; I am even super happy that, for example, University of Chicago and Harvey Mudd are in a separate cluster than MIT and Harvard. The clustering algorithm is really picking up on subtle differences in colleges here. Let’s look, just for kicks, a that small and spread out cluster that CalTech is part of.

colleges <- colleges %>% 
    mutate(cluster = my_kmeans$cluster) 

colleges %>%
    filter(cluster == 1) %>%
    select(INSTNM)
## # A tibble: 16 × 1
##                                                          INSTNM
##                                                           <chr>
## 1                            California Institute of Technology
## 2                                  Morehouse School of Medicine
## 3              Des Moines University-Osteopathic Medical Center
## 4          University of Massachusetts Medical School Worcester
## 5                             Washington University in St Louis
## 6                                        Albany Medical College
## 7                                 Weill Cornell Medical College
## 8                       Icahn School of Medicine at Mount Sinai
## 9                                     SUNY College of Optometry
## 10         Oklahoma State University Center for Health Sciences
## 11              University of Texas Southwestern Medical Center
## 12                       The University of Texas Medical Branch
## 13                                       SIT Graduate Institute
## 14                              Eastern Virginia Medical School
## 15                 West Virginia School of Osteopathic Medicine
## 16 Louisiana State University Health Sciences Center-Shreveport

Huh, those are almost all medical schools, or schools where the medical school is very important.