December 6, 2016

What is Principal components Analysis (PCA)

Basic idea of Principal Component Analyis is to transform the data to achieve certain desirable properties

  • The data are transformed into multiple components
  • Each component captures the maximum variability in that "direction"
  • The components are all orthogonal (correlation of 0 to each other)

Flirting with the Math

Without delving into the mathematical details, principal components, do a matrix decomposition of the data. Assume variables are the columns and observations are all rows, then

  • PCA represents the eigenvalues and eigenvectors of the data matrix

How is PCA used?

Multiple Uses of PCA

  • Unsupervised learning (dimension reduction)
  • Supervised learning (removal of highly correlated or multicollinearity)
  • Principal components regression

Examine some code and analysis for PCA

usdata<-USArrests
head(usdata)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Examine some code and analysis for PCA

#mean of the data
apply(usdata,2,mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
#variance of the data
apply(usdata,2,var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

Units are different for the variables

Principal Component Loadings

pr_comp=prcomp(usdata,center=TRUE,scale=TRUE)
pr_comp_unscale=prcomp(usdata,scale=FALSE)
pr_comp$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
pr_comp_unscale$rotation
##                 PC1         PC2         PC3         PC4
## Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
## Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
## UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
## Rape     0.07515550  0.20071807  0.97408059  0.07232502

Scaled and Unscaled PCA

Scaling is critical for PCA

Variance Explained by the Components

Reference

An Introduction to Statistical Learning
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani