Principal Component Analysis (PCA)

Ashish Dalal
30 November, 2015

Crux PCA Ideas

Dimensionality Reduction Technique (Variable Reduction Strategy)
A tool for data visualization and preprocessing
Transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components
Captures as much information in the original variables as possible
Less Noisy Results in the end
Essence: When presented with a large set of correlated variables, principal components allow us to summarize the dataset into a smaller number of representitive variables that collectively explain most of the variablility in the original set

Principle Components

Consider a case of visualizing n observations with p features
Scatterplots of data in 2D is useful in capturing correlation
However, as p tends to be large, we need alternatives
Specifically, we want to find a low-dimensional representation of the data that captures as much information as possibe
PCA seeks a subset of original dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary across the dimension
Each of the dimensions found by PCA is a linear combination of the p features, so this is technically not a form of feature selection.

Pre-requisite: Before computing PCA the data must be centered and have a mean zero (unless all the data are of the same unit)
For an explanation on how principal components are found, please refer here
Once PCA is done, we measure how much information is lost by utilizing principal components. To do this we can compute the proportion of variance explained (PVE) by each principal component
It is generally best explained as a cumulative plot, such that we can visualize the PVE for each component and the total variance explained
Once we have this measurement, we can start to conclude if the principal components explain enough data to provide an accuracy summary

When to say enough ?

In general, we would like to use the smallest number of principal components required to get a good understanding of the data
However, there is no single threshold that answers this question
Arguably, the best way to do this is to visualize the data in a scree plot
It is simply a plot of the cumulative PVE
We examine the elbow effect: see the scree plot when the percent variation drops off, such that addition prinicipal components don’t really add a significant amount of variance

PCA in action

#clearing the workspace prior to start
rm(list = ls())

#loading up libraries
library(stats)
library(graphics)

#reading the dataset
data <- read.csv("~/Desktop/data.csv")

#checking the dimension of data
dim(data)

[1] 768  10

#getting brief data description
names(data)

 [1] "Compactness"    "SurfaceArea"    "WallArea"       "RoofArea"      
 [5] "OverallHeight"  "Orientation"    "GlazingArea"    "Gadistribution"
 [9] "HeatingLoad"    "CoolingLoad"

kable(summary(data[,1:7]))

Compactness	SurfaceArea	WallArea	RoofArea	OverallHeight	Orientation	GlazingArea
Min. :0.6200	Min. :514.5	Min. :245.0	Min. :110.2	Min. :3.50	Min. :2.00	Min. :0.0000
1st Qu.:0.6825	1st Qu.:606.4	1st Qu.:294.0	1st Qu.:140.9	1st Qu.:3.50	1st Qu.:2.75	1st Qu.:0.1000
Median :0.7500	Median :673.8	Median :318.5	Median :183.8	Median :5.25	Median :3.50	Median :0.2500
Mean :0.7642	Mean :671.7	Mean :318.5	Mean :176.6	Mean :5.25	Mean :3.50	Mean :0.2344
3rd Qu.:0.8300	3rd Qu.:741.1	3rd Qu.:343.0	3rd Qu.:220.5	3rd Qu.:7.00	3rd Qu.:4.25	3rd Qu.:0.4000
Max. :0.9800	Max. :808.5	Max. :416.5	Max. :220.5	Max. :7.00	Max. :5.00	Max. :0.4000

kable(summary(data[,8:10]))

Gadistribution	HeatingLoad	CoolingLoad
Min. :0.000	Min. : 6.01	Min. :10.90
1st Qu.:1.750	1st Qu.:12.99	1st Qu.:15.62
Median :3.000	Median :18.95	Median :22.08
Mean :2.812	Mean :22.31	Mean :24.59
3rd Qu.:4.000	3rd Qu.:31.67	3rd Qu.:33.13
Max. :5.000	Max. :43.10	Max. :48.03

Conclusion: We can see the data have very different descriptive statistics. Further, the variables are measured on totally different scales
Lesson: If we don't standardize the data now, trouble lies ahead

Perform PCA using the prcomp() function, the rotation matrix provides the principal component loadings

pr.out <- prcomp(data, scale=TRUE)
kable(pr.out$rotation[,1:5])

	PC1	PC2	PC3	PC4	PC5
Compactness	-0.3782382	0.3780101	-0.0961734	0.0004256	-0.0056578
SurfaceArea	0.3876436	-0.3621610	0.0909008	-0.0004457	0.0038601
WallArea	-0.1062748	-0.6842761	0.3728286	-0.0074353	-0.1388195
RoofArea	0.4293324	-0.0226808	-0.0914195	0.0031563	0.0708081
OverallHeight	-0.4291436	0.0019210	0.0925547	-0.0028939	-0.0627936
Orientation	-0.0011190	-0.0049823	0.0011794	0.9998752	-0.0038127
GlazingArea	-0.0463028	-0.3041805	-0.6465749	-0.0000632	0.6363451
Gadistribution	-0.0154822	-0.1882539	-0.6382946	-0.0029063	-0.7459251
HeatingLoad	-0.4021382	-0.2718396	-0.0304048	-0.0063587	0.0774642
CoolingLoad	-0.4034644	-0.2351787	0.0125631	0.0112517	0.0661125

kable(pr.out$rotation[,5:10])

	PC5	PC6	PC7	PC8	PC9	PC10
Compactness	-0.0056578	-0.3088552	0.3809042	-0.0194190	0.6811065	0.0000000
SurfaceArea	0.0038601	0.1313702	-0.0320973	-0.0021671	0.5065943	0.6598204
WallArea	-0.1388195	-0.4789628	0.1286283	0.0476003	0.0844790	-0.3267898
RoofArea	0.0708081	0.3594229	-0.0934213	-0.0251022	0.4531998	-0.6766428
OverallHeight	-0.0627936	0.0113953	-0.8504495	0.1171563	0.2572829	0.0000000
Orientation	-0.0038127	-0.0083547	-0.0041101	-0.0109894	-0.0001494	0.0000000
GlazingArea	0.6363451	-0.2607372	-0.0899318	0.0786798	-0.0086035	0.0000000
Gadistribution	-0.7459251	-0.0075680	-0.0045351	0.0202427	-0.0007590	0.0000000
HeatingLoad	0.0774642	0.3848062	0.0982822	-0.7741000	0.0213652	0.0000000
CoolingLoad	0.0661125	0.5589665	0.2959053	0.6140397	0.0142768	0.0000000

We see that there are 10 distinct principal components in total

biplot(pr.out, scale=0)

plot of chunk unnamed-chunk-7

Interpretation of Biplot

Firstly start by looking at the axis, PC1 on the x and PC2 on the y. The arrows show how they are moving across the two dimensions
We see that two of the variables namely OverallHeight & RoofArea are moving along the first principle component, where as rest of the variables are all moving along both the principal components
The rown umbers colored in black show how each observation varies across the PC directions. For example, 48th observation points to large roof area where as 768th observation points to large surface area, large wall area, large glazing area and high ga distribution

Checking out contribution of principal components

The $sdev attribute outputs the standard deviation of each component; the variance explained by each component can computed by squaring these:

pr.var <- pr.out$sdev^2
pr.var

 [1] 5.222885e+00 1.533386e+00 1.218894e+00 1.000177e+00 8.047700e-01
 [6] 1.630918e-01 3.308280e-02 1.935870e-02 4.353904e-03 4.582035e-30

Then to compute the proportion of variance explained by each component, we sinply divide it by the total variance

pve <- pr.var / sum(pr.var)
pve * 100

 [1] 5.222885e+01 1.533386e+01 1.218894e+01 1.000177e+01 8.047700e+00
 [6] 1.630918e+00 3.308280e-01 1.935870e-01 4.353904e-02 4.582035e-29

Here we see the first PC explains about ~52.22885% of variability in the data, and the second PC explains ~ 15.33386%, and likewise. We can also plot this information.

Scree Plot

plot of chunk unnamed-chunk-10

Interpreting the Scree Plot

We can see that maximum variability in data is captured by the very first principal component
We see a decresing trend in the proportion of variance explained by the rest of the principal components
Principal components from 7-10 explain least variance in the data

PCA Variable Factor Map(QuickTrick)

library(FactoMineR)
PCA(data)

plot of chunk unnamed-chunk-11

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 768 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"