Perform PCA on USArrests dataset.
states=row.names(USArrests)
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
Brief examination of the data shows vastly different means.
apply(USArrests,2,mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
# note that these features have vastly different variances
apply(USArrests,2,var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
Standardize the dataset s.t. vars have mean zero and SD 1 and perform PCA.
pr.out=prcomp(USArrests,scale=TRUE) # prcomp centers vars to mean zero by default; scale set to TRUE sets SD's to 1.
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
The center and scale components are the means and SDs of the vars pre-scaling.
pr.out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
pr.out$scale
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
The rotation matrix is akin to the covariance matrix. Each column contains the corresponding principal component loading vector, before the reduction.
pr.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
biplot(pr.out,scale=0)
# because principal components are unique up to scalar multiplication:
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out,scale=0)
We get SDs for the PCs.
pr.out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
# similarly, var
pr.var=pr.out$sdev^2
Compute proportion of var explained by each PC by divinding the var explained by each PC by total var.
pve=pr.var/sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
The first PC explains 62%, the second 24.7% of total var, and so forth. Plot PVE explained by each PC, as well as cumulative PVE.
plot(pve,xlab="PC",ylab="Proportion of Var Explained",ylim=c(0,1),type='b')
plot(cumsum(pve),xlab="PC",ylab="Cumulative Propotion of Var Explained",ylim=c(0,1),type='b')
Start with example dataset where there are two true clusters, and the first 25 obs have a mean shift relative to the next 25.
set.seed(2)
x=matrix(rnorm(50*2),ncol=2)
x[1:25,1]=x[1:25,1]+3
x[1:25,2]=x[1:25,2]-4
plot(x)
Now perform Kmeans with K=2 like so.
km.out=kmeans(x,2,nstart=20)
Cluster assignments of the 50 observations in km.out$cluster.
km.out$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Replot data with cluster colored.
plot(x,col=(km.out$cluster+1),main="K-Means Clustering Results with K=2",xlab="",ylab="",pch=20,cex=2)