1. Principal Components Analysis

Perform PCA on USArrests dataset.

states=row.names(USArrests)
names(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

Brief examination of the data shows vastly different means.

apply(USArrests,2,mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

# note that these features have vastly different variances
apply(USArrests,2,var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

Standardize the dataset s.t. vars have mean zero and SD 1 and perform PCA.

pr.out=prcomp(USArrests,scale=TRUE) # prcomp centers vars to mean zero by default; scale set to TRUE sets SD's to 1.
names(pr.out)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

The center and scale components are the means and SDs of the vars pre-scaling.

pr.out$center

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

pr.out$scale

##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

The rotation matrix is akin to the covariance matrix. Each column contains the corresponding principal component loading vector, before the reduction.

pr.out$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

biplot(pr.out,scale=0)

# because principal components are unique up to scalar multiplication:
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out,scale=0)

We get SDs for the PCs.

pr.out$sdev

## [1] 1.5748783 0.9948694 0.5971291 0.4164494

# similarly, var
pr.var=pr.out$sdev^2

Compute proportion of var explained by each PC by divinding the var explained by each PC by total var.

pve=pr.var/sum(pr.var)
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

The first PC explains 62%, the second 24.7% of total var, and so forth. Plot PVE explained by each PC, as well as cumulative PVE.

plot(pve,xlab="PC",ylab="Proportion of Var Explained",ylim=c(0,1),type='b')

plot(cumsum(pve),xlab="PC",ylab="Cumulative Propotion of Var Explained",ylim=c(0,1),type='b')

2. K-Means Clustering

Start with example dataset where there are two true clusters, and the first 25 obs have a mean shift relative to the next 25.

set.seed(2)
x=matrix(rnorm(50*2),ncol=2)
x[1:25,1]=x[1:25,1]+3
x[1:25,2]=x[1:25,2]-4
plot(x)

Now perform Kmeans with K=2 like so.

km.out=kmeans(x,2,nstart=20)

Cluster assignments of the 50 observations in km.out$cluster.

km.out$cluster

##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Replot data with cluster colored.

plot(x,col=(km.out$cluster+1),main="K-Means Clustering Results with K=2",xlab="",ylab="",pch=20,cex=2)

PCA_and_K_Means

1. Principal Components Analysis

2. K-Means Clustering