Unsupervised Learning

Task: We are given a set of observations \(x_1, x_2, \cdots, x_n\) where each \(x_i \in R^p\). Interesting cases are \(p\) and \(n\) large – lots of observations. What can you discover from the data?

Aside: Semi-supervised learning is where your data consists of \[ \{ (y_i, x_i) \}_{i=1:n_1}\] \[ \{ x_i \}_{i=n_1+1:n} \] where we assume that \(x_i\) are samples from the same distribution.

Principal Component Analysis

Think of your data as an \(n\times p\) matrix \[ X = \left[ \begin{array}{c} x_1^T \\ x_2^T \\ \vdots \\ x_n^T \end{array} \right]\] Suppose each column has been normalized to have mean zero.

We are look for a unit vector \(\phi_1 \in R^p\) such that \[ z_{i1} = \phi_1^T x_i \] has the largest sample variance. That is, we want to solve \[ \max_{\phi_1} \frac{1}{n} \sum_i (\phi_1^T x_i)^2 \] subject to \(\|\phi_1\|=1\).

How to solve this? Notice that the above can be rewritten as \[ \max_{\phi_1} \phi_1^T X^T X \phi_1 \] This is just the Rayleigh quotient, which characterizes the first eigenvalue of \(X^T X\). That is, \[ X^T X \phi_1 = \lambda_1 \phi_1 \] In the language of statistics, \(\phi_1\) is the first principal loading vector because it is most important in explaning the variation in the data. The vector \(\phi_1\) is also referred to as the first principal component. Next we find the second principal loading vector (or second principal component) by solving \[ \max_{\phi_2} \phi_2^T X^T X \phi_2 \] subject to \(\phi_2^T\phi_1=0\). If the data lie close to a small dimensional subspace, we expect that if we repeat this process, each time finding a unit vector that is perpendicular to the previous ones, the variance will keep going down. We can stop when the variance bacomes small. This process can be formalized by recognizing its relation to the Singular Value Decomposition. We are basically writing down the decomposition of \(X\) as follows \[ X = U \Sigma V^T \] where \(U\) is \(n\times n\), \(V\) is \(p\times p\) and whos columns are \(\phi_j\), and \(\Sigma\) is a diagonal matrix of size \(n\times p\), for \(n>p\) it looks like \[ \Sigma = \left[ \begin{array}{ccccc} \sigma_1 & 0 & \cdots & \cdots & 0 \\ 0 & \sigma_2 & 0 &\cdots & 0 \\ \vdots & & \ddots & & \vdots \\ \vdots & & & \ddots & \vdots \\ 0 & 0 & \cdots & \cdots & \sigma_p \\ 0 & \cdots & \cdots & \cdots & 0 \\ \vdots & & & & \vdots \\ 0 & \cdots & \cdots & \cdots & 0 \end{array} \right] \] and \(\lambda_j = \sigma_j^2\). If as we expect the variance in successive inner products with the loading vectors is decreasing quickly, then \(\sigma_j\) should drop off rapidly.

You can view \(\phi_j\) as a very special basis for the \(R^p\), designed to explain the data \(X\). How many \(\phi_j\) do we need? We calculate Proportion of Variance Explained (PVE) \[ \mbox{PVE}_j = \frac{ \phi_j^T X^T X \phi_j }{\|X\|_F^2} \] Note that the numerator is just \(\lambda_j\). So, we can quit at \(m\) if \[ \sum_{j=1}^m \mbox{PVE}_j \geq \mbox{tol} \] Let’s do an example. USArrests has 50 rows (for 50 states), each row reports on Assaults, Murder and Rape per 100000 residents, and also UrbanPop which records the percent population in that state that are living in urban areas.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
names(USArrests)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
# calculate the mean of the columns
apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916
# let's grab the names of the states for later use
states = row.names(USArrests)
# let's do PCA but scale the data to have mean 0 and var 1
pr.out = prcomp(USArrests, scale = TRUE)
# let's see what's in pr.out
names(pr.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
pr.out$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pr.out$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385
# let's look at the principal components
pr.out$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

PC1 loading vector has big weights in Murder, Assault and Rape, while PC2 has a big weight in UrbanPop. Biplot is a convenient way to visualize the outcome. To match the picture in the book, we need to negate rotation and x in pr.out

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale = 0)

Let’s see how much of the data is explained by the principal components

pr.var = pr.out$sdev^2
pve = pr.var/sum(pr.var)
plot(pve, xlab = "PC", ylab = "PVE", ylim = c(0,1), type = 'b')

plot(cumsum(pve), xlab = "PC", ylab = "Cumulative PVE", ylim = c(0,1), type = 'b')

Neural Network Example

In this example, the data set is about breakfast cereals. Each row is a cereal brand; the columns are calories, protein, fat, sodium, fiber, etc.

data = read.csv("C:/Users/santosa/Documents/R/cereal.csv", header=T)
# Random sampling
samplesize = 0.60 * nrow(data)
set.seed(80)
index = sample( seq_len ( nrow ( data ) ), size = samplesize )

# Create training and test set
datatrain = data[ index, ]
datatest = data[ -index, ]
## Scale data for neural network

#max = apply(data, 2, "max")
#min = apply(data, 2, "min")
#scaled = as.data.frame(scale(data, center = min, scale = max - min))
scaled <- cbind(data[,c(1:3)], scale(data[,c(4:16)], center=TRUE, scale = TRUE))

# load library
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.3.3
# creating training and test set
trainNN = scaled[index , ]
testNN = scaled[-index , ]

# fit neural network
set.seed(2)
NN = neuralnet(rating ~ calories + protein + fat + sodium + fiber, trainNN, hidden = 5 , linear.output = T )

# plot neural network
plot(NN)

Test NN

predict_testNN = compute(NN, testNN[,c(4:8)])

plot(testNN$rating, predict_testNN$net.result, col='blue', pch=16, ylab = "predicted rating NN", xlab = "real rating")

abline(0,1)

Projects for the afternoon (done in groups)

Here are some data that I think would be interesting to look at and some suggestions of the type of analysis you might do. Do as much as you can by 3:00pm. We will recap at 4:00pm.

http://archive.ics.uci.edu/ml/datasets/Wholesale+customers

  • Analyze data by region
  • Cluster data into groups - do they correspond to the regions
  • PCA vector with rows 1 to 6

http://archive.ics.uci.edu/ml/datasets/seeds

  • Cluster data (3 groups) and assess
  • Classification task: Do the attributes tell you their class?

http://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition

  • Classification problem with 5 classes

http://archive.ics.uci.edu/ml/datasets/HTRU2

  • Do the data (without the label) cluster into 2 groups
  • Classification problem