Principal Component Analysis (PCA)

Ashish Dalal
30 November, 2015

Crux PCA Ideas

  • Dimensionality Reduction Technique (Variable Reduction Strategy)
  • A tool for data visualization and preprocessing
  • Transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components
  • Captures as much information in the original variables as possible
  • Less Noisy Results in the end
  • Essence: When presented with a large set of correlated variables, principal components allow us to summarize the dataset into a smaller number of representitive variables that collectively explain most of the variablility in the original set

Principle Components

  • Consider a case of visualizing n observations with p features
  • Scatterplots of data in 2D is useful in capturing correlation
  • However, as p tends to be large, we need alternatives
  • Specifically, we want to find a low-dimensional representation of the data that captures as much information as possibe
  • PCA seeks a subset of original dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary across the dimension
  • Each of the dimensions found by PCA is a linear combination of the p features, so this is technically not a form of feature selection.
  • Pre-requisite: Before computing PCA the data must be centered and have a mean zero (unless all the data are of the same unit)
  • For an explanation on how principal components are found, please refer here
  • Once PCA is done, we measure how much information is lost by utilizing principal components. To do this we can compute the proportion of variance explained (PVE) by each principal component
  • It is generally best explained as a cumulative plot, such that we can visualize the PVE for each component and the total variance explained
  • Once we have this measurement, we can start to conclude if the principal components explain enough data to provide an accuracy summary

When to say enough ?

  • In general, we would like to use the smallest number of principal components required to get a good understanding of the data
  • However, there is no single threshold that answers this question
  • Arguably, the best way to do this is to visualize the data in a scree plot
  • It is simply a plot of the cumulative PVE
  • We examine the elbow effect: see the scree plot when the percent variation drops off, such that addition prinicipal components don’t really add a significant amount of variance

PCA in action

#clearing the workspace prior to start
rm(list = ls())

#loading up libraries
library(stats)
library(graphics)

#reading the dataset
data <- read.csv("~/Desktop/data.csv")

#checking the dimension of data
dim(data)
[1] 768  10
#getting brief data description
names(data)
 [1] "Compactness"    "SurfaceArea"    "WallArea"       "RoofArea"      
 [5] "OverallHeight"  "Orientation"    "GlazingArea"    "Gadistribution"
 [9] "HeatingLoad"    "CoolingLoad"   
kable(summary(data[,1:7]))
Compactness SurfaceArea WallArea RoofArea OverallHeight Orientation GlazingArea
Min. :0.6200 Min. :514.5 Min. :245.0 Min. :110.2 Min. :3.50 Min. :2.00 Min. :0.0000
1st Qu.:0.6825 1st Qu.:606.4 1st Qu.:294.0 1st Qu.:140.9 1st Qu.:3.50 1st Qu.:2.75 1st Qu.:0.1000
Median :0.7500 Median :673.8 Median :318.5 Median :183.8 Median :5.25 Median :3.50 Median :0.2500
Mean :0.7642 Mean :671.7 Mean :318.5 Mean :176.6 Mean :5.25 Mean :3.50 Mean :0.2344
3rd Qu.:0.8300 3rd Qu.:741.1 3rd Qu.:343.0 3rd Qu.:220.5 3rd Qu.:7.00 3rd Qu.:4.25 3rd Qu.:0.4000
Max. :0.9800 Max. :808.5 Max. :416.5 Max. :220.5 Max. :7.00 Max. :5.00 Max. :0.4000
kable(summary(data[,8:10]))
Gadistribution HeatingLoad CoolingLoad
Min. :0.000 Min. : 6.01 Min. :10.90
1st Qu.:1.750 1st Qu.:12.99 1st Qu.:15.62
Median :3.000 Median :18.95 Median :22.08
Mean :2.812 Mean :22.31 Mean :24.59
3rd Qu.:4.000 3rd Qu.:31.67 3rd Qu.:33.13
Max. :5.000 Max. :43.10 Max. :48.03
  • Conclusion: We can see the data have very different descriptive statistics. Further, the variables are measured on totally different scales
  • Lesson: If we don't standardize the data now, trouble lies ahead
  • Perform PCA using the prcomp() function, the rotation matrix provides the principal component loadings
pr.out <- prcomp(data, scale=TRUE)
kable(pr.out$rotation[,1:5])
PC1 PC2 PC3 PC4 PC5
Compactness -0.3782382 0.3780101 -0.0961734 0.0004256 -0.0056578
SurfaceArea 0.3876436 -0.3621610 0.0909008 -0.0004457 0.0038601
WallArea -0.1062748 -0.6842761 0.3728286 -0.0074353 -0.1388195
RoofArea 0.4293324 -0.0226808 -0.0914195 0.0031563 0.0708081
OverallHeight -0.4291436 0.0019210 0.0925547 -0.0028939 -0.0627936
Orientation -0.0011190 -0.0049823 0.0011794 0.9998752 -0.0038127
GlazingArea -0.0463028 -0.3041805 -0.6465749 -0.0000632 0.6363451
Gadistribution -0.0154822 -0.1882539 -0.6382946 -0.0029063 -0.7459251
HeatingLoad -0.4021382 -0.2718396 -0.0304048 -0.0063587 0.0774642
CoolingLoad -0.4034644 -0.2351787 0.0125631 0.0112517 0.0661125
kable(pr.out$rotation[,5:10])
PC5 PC6 PC7 PC8 PC9 PC10
Compactness -0.0056578 -0.3088552 0.3809042 -0.0194190 0.6811065 0.0000000
SurfaceArea 0.0038601 0.1313702 -0.0320973 -0.0021671 0.5065943 0.6598204
WallArea -0.1388195 -0.4789628 0.1286283 0.0476003 0.0844790 -0.3267898
RoofArea 0.0708081 0.3594229 -0.0934213 -0.0251022 0.4531998 -0.6766428
OverallHeight -0.0627936 0.0113953 -0.8504495 0.1171563 0.2572829 0.0000000
Orientation -0.0038127 -0.0083547 -0.0041101 -0.0109894 -0.0001494 0.0000000
GlazingArea 0.6363451 -0.2607372 -0.0899318 0.0786798 -0.0086035 0.0000000
Gadistribution -0.7459251 -0.0075680 -0.0045351 0.0202427 -0.0007590 0.0000000
HeatingLoad 0.0774642 0.3848062 0.0982822 -0.7741000 0.0213652 0.0000000
CoolingLoad 0.0661125 0.5589665 0.2959053 0.6140397 0.0142768 0.0000000
  • We see that there are 10 distinct principal components in total
biplot(pr.out, scale=0)

plot of chunk unnamed-chunk-7

Interpretation of Biplot

  • Firstly start by looking at the axis, PC1 on the x and PC2 on the y. The arrows show how they are moving across the two dimensions
  • We see that two of the variables namely OverallHeight & RoofArea are moving along the first principle component, where as rest of the variables are all moving along both the principal components
  • The rown umbers colored in black show how each observation varies across the PC directions. For example, 48th observation points to large roof area where as 768th observation points to large surface area, large wall area, large glazing area and high ga distribution

Checking out contribution of principal components

  • The $sdev attribute outputs the standard deviation of each component; the variance explained by each component can computed by squaring these:
pr.var <- pr.out$sdev^2
pr.var
 [1] 5.222885e+00 1.533386e+00 1.218894e+00 1.000177e+00 8.047700e-01
 [6] 1.630918e-01 3.308280e-02 1.935870e-02 4.353904e-03 4.582035e-30
  • Then to compute the proportion of variance explained by each component, we sinply divide it by the total variance
pve <- pr.var / sum(pr.var)
pve * 100
 [1] 5.222885e+01 1.533386e+01 1.218894e+01 1.000177e+01 8.047700e+00
 [6] 1.630918e+00 3.308280e-01 1.935870e-01 4.353904e-02 4.582035e-29
  • Here we see the first PC explains about ~52.22885% of variability in the data, and the second PC explains ~ 15.33386%, and likewise. We can also plot this information.

Scree Plot

plot of chunk unnamed-chunk-10

Interpreting the Scree Plot

  • We can see that maximum variability in data is captured by the very first principal component
  • We see a decresing trend in the proportion of variance explained by the rest of the principal components
  • Principal components from 7-10 explain least variance in the data

PCA Variable Factor Map(QuickTrick)

library(FactoMineR)
PCA(data)

plot of chunk unnamed-chunk-11 plot of chunk unnamed-chunk-11

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 768 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"