Project 2: PubHealth 251D

Visualizing Similarity Among University of California Schools

The goal of this is to determine which University of California campuses are most similar. To achieve this, we will use data from .

Load libraries

library(ggplot2)
library(ggthemes)
library(ggfortify)
library(ape)
library(ggdendro)
library(cluster)

Loading the Data

# -------------------------------------------------------------------------------------
# Read in dataset
# -------------------------------------------------------------------------------------
setwd("~/Desktop/Education_Project/data/")


my_file  <- "MERGED2014_15_PP.csv"

dat      <- read.csv(my_file, row.names = 1, stringsAsFactors = FALSE)

This dataset is rather large and lives in very high-dimensions:

dim(dat)

## [1] 7703 1742

Moreover, as is typical with real-world data, a lot of the values are missing. We can either impute these missing values or simply remove them. Because the data lives in such high-dimensions, I opt to identify those features that have a lot of missing values and remove them from the dataset. Note that this is for brevity; were it a real-world project this would not be a good idea as we may lose potential useful information.

# -------------------------------------------------------------------------------------
# functions to count NULL and PrivacySuppressed entries
# -------------------------------------------------------------------------------------
null_counts    <- apply(dat, 2, function(x) sum(x=="NULL"))
privacy_counts <- apply(dat, 2, function(x) sum(x=="PrivacySuppressed"))
total_counts   <- null_counts + privacy_counts

# -------------------------------------------------------------------------------------
# selecting variables that do not contain majority NULL/PrivacySuppressed values
# -------------------------------------------------------------------------------------
good_vars     <- colnames(dat)[total_counts < 3500]; good_vars <- unique(good_vars)


# -------------------------------------------------------------------------------------
# selecting variables that do not contain majority NULL/PrivacySuppressed values
# -------------------------------------------------------------------------------------
good_vars     <- colnames(dat)[total_counts < 3500]; good_vars <- unique(good_vars)

good_vars     <- good_vars[c(3,12:length(good_vars))]

Similarity Pre-processing

There are a number of ways to measure similarity between elements in a set. For this case, I will create a distance matrix between the observations. Thus, our data must be encoded as numeric.

# -------------------------------------------------------------------------------------
# keeping data with only numeric mode
# -------------------------------------------------------------------------------------
dat_eda                 <- dat[,-c(1,219,40)]
dat_eda                 <- apply(dat_eda, 2, function(x) as.numeric(x))
dat_eda                 <- data.frame(dat_eda)
dat_eda[is.na(dat_eda)] <- 0
zero_counts             <- apply(dat, 2, function(x) sum(x==0))
good_vars2              <- colnames(dat_eda)[zero_counts < 5000]; 
good_vars2              <- unique(good_vars2)
dat_eda                 <- dat_eda[,na.omit(good_vars2)]

Getting UC schools

As we want to view the similarity between UC schools, we’ll need to select them.

# -------------------------------------------------------------------------------------
# Find data pertaining to UC schools only
# -------------------------------------------------------------------------------------
cali               <- grep("University of California-", dat$INSTNM)
dat_cali           <- dat_eda[cali,]

# -------------------------------------------------------------------------------------
# Remove constant variance columns
# -------------------------------------------------------------------------------------
non_zero_var       <- as.vector(sapply(dat_cali, function(x) var(x) != 0))
nonzero_columns    <- names(dat_cali)[non_zero_var]
cols_to_keep       <- names(dat_cali)[names(dat_cali) %in% nonzero_columns]

dat_cali           <- dat_cali[,cols_to_keep]


rownames(dat_cali) <- as.character(dat$INSTNM[cali])

Now our data is ready. Let’s view the size of the data and briefly glimpse at it to see what we’re working with

dim(dat_cali)

## [1]  11 276

dat_cali[,sample(201,2)]

##                                                  RET_PT4 HI_INC_RPY_3YR_RT
## University of California-Hastings College of Law  0.0000         0.0000000
## University of California-Berkeley                 0.7586         0.9526847
## University of California-Davis                    0.6800         0.9634581
## University of California-Irvine                   0.6364         0.9407059
## University of California-Los Angeles              0.6154         0.9591042
## University of California-Riverside                0.5000         0.8833552
## University of California-San Diego                0.8000         0.9616924
## University of California-San Francisco            0.0000         0.0000000
## University of California-Santa Barbara            0.5185         0.9523810
## University of California-Santa Cruz               0.1429         0.9401955
## University of California-Merced                   1.0000         0.8820513

Create Distance metric

d  <- dist(dat_cali)
hc <- hclust(d)

Now we can begin to visualize the similarity

Visualizing Similarity

Multi-Dimensional Scaling (MDS)

MDS is like pca, but instead of preserving variance between observations it preserves a given distance. In our case, that distance is euclidean.

View raw distance from each other

PCA

Phylo Trees

Tree 1: Fan Style

Tree 2: Unrooted

Hierarchical Clustering

Schools most similar to Berkeley

##                University of California-Berkeley 
##                                             0.00 
##           University of California-Santa Barbara 
##                                         15891.86 
##                  University of California-Irvine 
##                                         16745.43 
##               University of California-San Diego 
##                                         20358.18 
##               University of California-Riverside 
##                                         24294.76 
##                   University of California-Davis 
##                                         96049.72 
##              University of California-Santa Cruz 
##                                         97530.24 
##             University of California-Los Angeles 
##                                         97867.28 
##                  University of California-Merced 
##                                        109678.31 
## University of California-Hastings College of Law 
##                                        166069.70 
##           University of California-San Francisco 
##                                        192245.12

UC similarity

Jared Wilber SID #24881068

December 3, 2016