Abstract

This is home project number 3 on SVD and PCA. Trying to familiarize myself with concepts (part1) and application of SVD and PCR on real-life imaging data (part2).

PART 1: SVD AND PCA ON PROTEIN EXPRESSION DATA

1. Data Sourse

Data set used in this project if ‘Mice Protein Expression Data Set’ from UCI machine learning repository, and meta-data is here (http://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression#). This data set is from a research trying to understand molecular events during associative learning process in two different groups of mice, normal and Down syndrome. Different groups of mice were exposed to context fear conditioning and expression of 77 proteins detectable in their cerebral cortex were measured. 15 registries were taken for each protein per mice. This real data set is at dimension of 1080 by 77. You see several columns more on labelings.

Classes of mice: c-CS-s: control mice, stimulated to learn, injected with saline (9 mice) c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice) c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice) c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)

t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice) t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice) t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice) t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

2. Data loading and processing

Download dataset from “http://archive.ics.uci.edu/ml/machine-learning-databases/00342/” as .xls file. Converted it to csv and read it into r stored in df1 as a dataframe. Missing values were delt with and extra columns deleted.

library(ggplot2)
library(R.matlab)

## R.matlab v3.3.0 (2015-09-22) successfully loaded. See ?R.matlab for help.
## 
## Attaching package: 'R.matlab'
## 
## The following objects are masked from 'package:base':
## 
##     getOption, isOpen

df1 <- read.csv("Data_Cortex_Nuclear.csv")

Imputing NAs Many NAs exist in this data frame.

#How many NAs each rows has?
apply(df1, 1, function(x)sum(is.na(x)))
#How many rows has at least 1 NAs?
sum(apply(df1, 1, function(x)sum(is.na(x)))!=0)
#How much % of rows has NAs?
sum(apply(df1, 1, function(x)sum(is.na(x)))!=0)/nrow(df1)

As you see, 528 rows which account for 48.9% of all rows has NAs. So it is better not to exlude all row s with NA, because we are loosing too much information Let’s take a further look at NAs

#How those NAs distributed among rows
table(apply(df1, 1, function(x)sum(is.na(x))))

## 
##   0   1   2   3   4   5  43 
## 552 165 120 117 104  19   3

Most of the rows contains less than 6 NAs out of 77 variables. For then we can fill the value in by using the average of the same class mice and treatments (according to column “class”). For the rest 3 rows which contain 43 NAs out of 77 variables, we will exclude them.

#Exclude 3 rows with 43 NAs.
df2<-df1[!(apply(df1, 1, function(x)sum(is.na(x)))==43),]
dim(df2)

## [1] 1077   82

#Filling in the local means (mean calculated from available values from same class based on variable "class") in NAs. Before that, subset columns with data only, store class columne in separate variable names class
classes <- df2[,"class"]
df3 <- df2[, -c(1, (ncol(df2)-3):ncol(df2))] 
dim(df3)

## [1] 1077   77

for(i in 1:nrow(df3)){
  for(j in 1:ncol(df3)){
    if(is.na(df3[i,j])){
      df3[i,j] <- tapply(df3[,j],classes, mean, na.rm=T)[classes[j]]
    }
  }
}

Now we have a matrix of dimension 1077 * 77 with no NAs.

Normalization of data to make the data fit PCA better.

normalizeVector <- function(x){
  #center it
  y <- x - mean(x)
  #make length one
  z <- y/as.numeric(sqrt(y%*%y))
  return(z)
}
df4 <- apply(df3, 2, normalizeVector)

3. PCA and plot

sv<-svd(df4)
#plot1 of lambda/sum(lambda) v.s. index
l <- sv$d^2
sl <- sum(l)
qplot(1:length(l), l/sl, main="Proportions of Eigen Values", ylab="Eigenvalue(i)/sum(eigenvalues)", xlab="index")

dim(sv$u)

## [1] 1077   77

dim(sv$v)

## [1] 77 77

# get matrix W = XV
w <- df4 %*% sv$v
dim(w)

## [1] 1077   77

#Plot principal components
par(mfrow=c(1,2), pin=c(1.9,1.9))
plot(w[,1],w[,2], main="PC1 vs PC2", xlab="PC1", ylab="PC2", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))
plot(w[,1],w[,3], main="PC1 vs PC3", xlab="PC1", ylab="PC3", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))

plot(w[,2],w[,3], main="PC2 vs PC3", xlab="PC2", ylab="PC3", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))
plot(w[,1],w[,70], main="PC1 vs PC70", xlab="PC1", ylab="PC70", xlim=c(-0.5,0.5), ylim=c(-0.5,0.5))

As you see according to the figuers, the wider those points spans, means higher variation in that principal direction. So from 1,2,3,70th principal direction. 1 is highest and 70 is smallest just as we expected.

PART 2: PCA ON IMAGES

1. Read in data

Data are provided by Dr. Jingyong Su and downloaded from course webpage in blackboard system of Texas Tech University. Data provided is in matlab file form, so R.matlab package is used to read in the data.

img<- readMat("HA4c_dat.mat")[[1]]
dim(img)

## [1] 200 644

2. SVD on Matrix

# center the data
img2 <- scale(img, scale=F)
sv2 <- svd(cov(img2))
#get eigenvectors/principal directions/loadings
V <- sv2$v
dim(sv2$v)

## [1] 644 644

dim(sv2$u)

## [1] 644 644

length(sv2$d)

## [1] 644

#image first three principal directions
par(pin=c(2.3*0.5,2.8*.5), mfrow=c(1,3))
image(t(matrix(V[,1], nrow=28)), ylim=c(1,0), main="v1", col = grey(seq(0, 1, length = 256)))
image(t(matrix(V[,2], nrow=28)), ylim=c(1,0), main="v2", col = grey(seq(0, 1, length = 256)))
image(t(matrix(V[,3], nrow=28)), ylim=c(1,0), main="v3", col = grey(seq(0, 1, length = 256)))

# image(t(matrix(V[,4], nrow=28)), ylim=c(1,0), main="v4", col = grey(seq(0, 1, length = 256)))

From the 4 pictures of v1 through v4, we can see the configur of human face is from clear to obscure. Which means the first principal contain most variation from all the human face pictures so we can still tell roughly it is a human. However when we proceed to v4, it contains much less variations that we saw in all the human face pictures, so we can barely tell it is a human face.

3. Projection of first picture onto first several principal subspaces.

Take picture one and map to first several principal subspace.

# Function for projection of a vector on first n principal subspace
proj <- function(x, V, n){
  if(n==0){return(x)}
  if(n==1){
    y <- x %*% V[,1] * V[,1]
    return(y)
  }
  y <- proj(x,V,n-1) + x %*% V[,n] * V[,n]
  return(y)
}
#plot original pic1
par(pin=c(2.3*0.5,2.8*.5), mfrow=c(1,3))
p0 <- img[1,]
image(t(matrix(p0, nrow=28)), ylim=c(1,0), col = grey(seq(0, 1, length = 256)))

# plot
par(pin=c(2.3*0.5,2.8*.5), mfrow=c(3,3))

for(i in c(1,6,20,50,80,150,300,450,644)){
  image(t(matrix(proj(p0, V, i), nrow=28)), ylim=c(1,0), col = grey(seq(0, 1, length = 256)))
}

As you see, the 1st projection picture is exactly the same as the picture produced by first principal direction v1. Because we just scale it up and R image function will adjuct the scale automatically so they are the same. When we are adding more projections on orthogonal dimensions back, it is more and more similar to original pictures. When you add all 644 projections back, it is exactly the same as original one.

So basically, PCA on all 200 pictures has weighted those 644 pixels in 644 different ways. which means it form a new basis of 644 orthogonal vectors in V. So each pictures could be decomposed to 644 projections on to these 644 vectors. And the 644 vectors has been well ordered according to their importance regarding how much variation they captured. So first several vectors could be used to represent a human face in general. That’s why you are see human face by imaging the first principal direction v1.

However, I am wondering if this could be used in human face recogonition by consider only the first several dimension of a picture?

Thanks for reading.

Home Work 3 for STAT5371

Li Sun

Wednesday, September 23, 2015