The goal of this is to determine which University of California campuses are most similar. To achieve this, we will use data from .
Load libraries
library(ggplot2)
library(ggthemes)
library(ggfortify)
library(ape)
library(ggdendro)
library(cluster)
# -------------------------------------------------------------------------------------
# Read in dataset
# -------------------------------------------------------------------------------------
setwd("~/Desktop/Education_Project/data/")
my_file <- "MERGED2014_15_PP.csv"
dat <- read.csv(my_file, row.names = 1, stringsAsFactors = FALSE)
This dataset is rather large and lives in very high-dimensions:
dim(dat)
## [1] 7703 1742
Moreover, as is typical with real-world data, a lot of the values are missing. We can either impute these missing values or simply remove them. Because the data lives in such high-dimensions, I opt to identify those features that have a lot of missing values and remove them from the dataset. Note that this is for brevity; were it a real-world project this would not be a good idea as we may lose potential useful information.
# -------------------------------------------------------------------------------------
# functions to count NULL and PrivacySuppressed entries
# -------------------------------------------------------------------------------------
null_counts <- apply(dat, 2, function(x) sum(x=="NULL"))
privacy_counts <- apply(dat, 2, function(x) sum(x=="PrivacySuppressed"))
total_counts <- null_counts + privacy_counts
# -------------------------------------------------------------------------------------
# selecting variables that do not contain majority NULL/PrivacySuppressed values
# -------------------------------------------------------------------------------------
good_vars <- colnames(dat)[total_counts < 3500]; good_vars <- unique(good_vars)
# -------------------------------------------------------------------------------------
# selecting variables that do not contain majority NULL/PrivacySuppressed values
# -------------------------------------------------------------------------------------
good_vars <- colnames(dat)[total_counts < 3500]; good_vars <- unique(good_vars)
good_vars <- good_vars[c(3,12:length(good_vars))]
There are a number of ways to measure similarity between elements in a set. For this case, I will create a distance matrix between the observations. Thus, our data must be encoded as numeric.
# -------------------------------------------------------------------------------------
# keeping data with only numeric mode
# -------------------------------------------------------------------------------------
dat_eda <- dat[,-c(1,219,40)]
dat_eda <- apply(dat_eda, 2, function(x) as.numeric(x))
dat_eda <- data.frame(dat_eda)
dat_eda[is.na(dat_eda)] <- 0
zero_counts <- apply(dat, 2, function(x) sum(x==0))
good_vars2 <- colnames(dat_eda)[zero_counts < 5000];
good_vars2 <- unique(good_vars2)
dat_eda <- dat_eda[,na.omit(good_vars2)]
As we want to view the similarity between UC schools, we’ll need to select them.
# -------------------------------------------------------------------------------------
# Find data pertaining to UC schools only
# -------------------------------------------------------------------------------------
cali <- grep("University of California-", dat$INSTNM)
dat_cali <- dat_eda[cali,]
# -------------------------------------------------------------------------------------
# Remove constant variance columns
# -------------------------------------------------------------------------------------
non_zero_var <- as.vector(sapply(dat_cali, function(x) var(x) != 0))
nonzero_columns <- names(dat_cali)[non_zero_var]
cols_to_keep <- names(dat_cali)[names(dat_cali) %in% nonzero_columns]
dat_cali <- dat_cali[,cols_to_keep]
rownames(dat_cali) <- as.character(dat$INSTNM[cali])
Now our data is ready. Let’s view the size of the data and briefly glimpse at it to see what we’re working with
dim(dat_cali)
## [1] 11 276
dat_cali[,sample(201,2)]
## RET_PT4 HI_INC_RPY_3YR_RT
## University of California-Hastings College of Law 0.0000 0.0000000
## University of California-Berkeley 0.7586 0.9526847
## University of California-Davis 0.6800 0.9634581
## University of California-Irvine 0.6364 0.9407059
## University of California-Los Angeles 0.6154 0.9591042
## University of California-Riverside 0.5000 0.8833552
## University of California-San Diego 0.8000 0.9616924
## University of California-San Francisco 0.0000 0.0000000
## University of California-Santa Barbara 0.5185 0.9523810
## University of California-Santa Cruz 0.1429 0.9401955
## University of California-Merced 1.0000 0.8820513
d <- dist(dat_cali)
hc <- hclust(d)
Now we can begin to visualize the similarity
MDS is like pca, but instead of preserving variance between observations it preserves a given distance. In our case, that distance is euclidean.
Tree 1: Fan Style
Tree 2: Unrooted
## University of California-Berkeley
## 0.00
## University of California-Santa Barbara
## 15891.86
## University of California-Irvine
## 16745.43
## University of California-San Diego
## 20358.18
## University of California-Riverside
## 24294.76
## University of California-Davis
## 96049.72
## University of California-Santa Cruz
## 97530.24
## University of California-Los Angeles
## 97867.28
## University of California-Merced
## 109678.31
## University of California-Hastings College of Law
## 166069.70
## University of California-San Francisco
## 192245.12