Dimension reduction

This is an R Markdown document created with examples of techniques of data reduction. Based on Coursera’s Exploratory Data Analysis course and some webpages.


Why

Taken from the web: “Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information. This is typically done while solving machine learning problems to get better features for a classification or regression task.

Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i’th movie, 1 otherwise ]. You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]

The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.

However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one. “”


PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.


How does it work?

Starting point: matrix M, where the rows are observations and the columns are variables and the variables have been normalised (subtract the mean, divide by the standard deviation)

We find the eigenvalues and eigenvectors of \(C=MM^T\), the covariance matrix. If some eigenvalues are 0 or very small, we can essentially discard those eigenvalues and the corresponding eigenvectors, hence reducing the dimensionality of the new basis.

How to interpret what we obtained? * the eigenvectors of C are new the orthogonal variables we are looking for * the eigenvalues of C tell us how much variance is explained by each single eigenvector

Now how do we compute these eigenvalues and eigenvectors? We may use SVD.


SVD: singular value decomposition

  1. singular value decomposition: A matrix \(M\) may be decomposed in the form \(M=UDV^T\), where \(V^T\) is the transpose of \(V\), the columns of \(U\) and \(V\) are orthogonal (\(U=U^T\) and \(V=V^T\)) and \(D\) is a diagonal matrix.

  2. diagonalisation of the covariance matrix: since the covariance matrix is symmetric, it is diagonalisable, and the eigenvectors can be normalized such that they are orthonormal: \(C=MM^T=WDW^T\), where \(W^T\) is the transpose of \(W\), the columns of \(W\) are the eigenvectors of \(C\) and \(D\) is a diagonal matrix containing the eigenvalues of \(C\).

  3. Since \(M=UDV^T\), then it is easy to see that \(MM^T=UD^2U^T\). Then,

  • the eigenvectors of the covariance matrix \(C\) are the columns of \(U\)
  • the (non-negative) eigenvalues of the covariance matrix \(C\) are the square root of the non-negative diagonal values of \(D\).

Notation: * vectors of \(U\): left-singular vectors of M * diagonal elements of \(D\): singular values of M * vectors of \(V\): right-singular vectors of M

The singular vectors or the singular values are sometimes referred to as singular components.

Example

Let’s take the following matrix (where again, think of each row as an observation and each column as a variable):

set.seed(12345); 
dataMatrix <- matrix(rnorm(400),nrow=40)

set.seed(678910)
for(i in 1:40){
  # flip a coin
  coinFlip <- rbinom(1,size=1,prob=0.5)
  # if coin is heads add a common pattern to that row
  if(coinFlip){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5)
  }
}

hh <- hclust(dist(dataMatrix)); dataMatrixOrdered <- dataMatrix[hh$order,]

image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])

plot of chunk unnamed-chunk-1

Can we use PCA to understand the variation in the data?

svd1 <- svd(scale(dataMatrixOrdered))

Note that svd1 has three components: svd1\(u, svd1\)v, svd1$d. What we see next is that the first component picked up a pattern that automatically separates the rows.

par(mfrow=c(1,2))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,1],40:1,,xlab="Row",ylab="First left singular vector",pch=19)

plot of chunk unnamed-chunk-3

The second component is not able to explain as much of the variability between the rows:

par(mfrow=c(1,2))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,2],40:1,,xlab="Row",ylab="First left singular vector",pch=19)

plot of chunk unnamed-chunk-4

So exactly how much variability is explained by each singular vector? We can think of each singular value as representing a percent of the total variation in the data that’s explained by that particular component.

par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)

plot of chunk unnamed-chunk-5


Note that the first right singular vector automatically separates the columns.

par(mfrow=c(1,2))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)

plot of chunk unnamed-chunk-6

Again, not much information is contained in the second singular vector:

par(mfrow=c(1,2))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)

plot of chunk unnamed-chunk-7

SVD and PCA: how to do it in R

Though SVD and PCA are numerically two different techniques (from what i understand: one requires that we compute the correlation matrix, which is not always wise to do), the results are the same, as we would expect

svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered,scale=TRUE)
plot(pca1$rotation[,1],svd1$v[,1],pch=19,xlab="Principal Component 1",ylab="Right Singular Vector 1")
abline(c(0,1))

plot of chunk unnamed-chunk-8

Word of caution

The svd function does not work if there are infinite or missing values on the data matrix. Let’s consider this data matrix:

dataMatrix2 <- dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100,size=40,replace=FALSE)] <- NA
head(dataMatrix2)
##         [,1]    [,2]   [,3]     [,4]     [,5]  [,6]  [,7]   [,8]  [,9]
## [1,]      NA -1.0494  1.261  0.19528  0.78680 3.167 3.577 4.3203 1.534
## [2,]      NA  0.4762  2.180  0.06777 -0.44677 3.628 2.044 5.2679 3.473
## [3,]  0.2987      NA     NA -0.42258 -1.81571 3.069 1.841 3.4643 3.461
## [4,]  1.8051      NA -0.536  0.54017  0.07642 3.672 2.837 3.0382 2.858
## [5,]  2.1968 -0.7846     NA  0.54840 -0.50519 2.221 2.298 2.9122 2.913
## [6,] -0.8864  0.6912 -0.505 -1.01112  2.48355 3.660 2.585 0.8497 1.481
##      [,10]
## [1,] 3.120
## [2,] 2.967
## [3,] 5.260
## [4,] 2.849
## [5,] 3.498
## [6,] 4.163

The function does not work on this dataset:

svd1 <- svd(scale(dataMatrix2)) ## Doesn't work!
## Error: infinite or missing values in 'x'

What we can do is use the inpute.knn function, which takes the missing value in a row and replaces it by the k nearest neighbours

library(impute) ## Available from http://bioconductor.org
dataMatrix2 <- impute.knn(dataMatrix2, k=5)$data
svd2 <- svd(scale(dataMatrix2))