Principal Components Analysis

Principal components analysis is a technique for reducing the dimensionality of a dataset. Reducing dimensionality would be easy if we didn’t have to worry about the explanatory power of our model– just drop variables! But how can you reduce dimensionality while preserving within the remaining explanatory variables as much information as possible about the response? PCA is one strategy for doing this.

One everyday example of dimensionality reduction is the process of casting a shadow onto a surface. The shadow retains some information about the shape of the object, but exactly how much information is retained by the shadow depends on the positions of the light source, the object, and the surface on which the shadow is cast. The shadow of a person at noon might look like a puddle beneath their feet. But the shadow of that same person at sunset might be recognizable as a human, albeit elongated across the ground to the east.

PCA seeks to position a surface just right so that the shadow cast by high-dimensional data retains as much information as possible. It does this by identifying the dimensions along which the data vary the greatest– and incorporating those dimensions into the lower-dimensional surface where the reduced data will reside.

Step-by-step

Let’s conduct PCA on the mtcars dataset, step by step. At the end, we’ll compare our results to those given by the built-in function prcomp. This blog post builds on work that I did for DATA 609.

  1. Subtract the mean from each column of data, so that each column’s mean is 0. Divide each column by its standard deviation, so that the standard deviation of each column is 1.
df <- mtcars
df <- scale(df, scale = TRUE)
summary(df)
##       mpg               cyl              disp               hp         
##  Min.   :-1.6079   Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810  
##  1st Qu.:-0.7741   1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320  
##  Median :-0.1478   Median :-0.105   Median :-0.2777   Median :-0.3455  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859  
##  Max.   : 2.2913   Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466  
##       drat               wt               qsec                vs        
##  Min.   :-1.5646   Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868  
##  1st Qu.:-0.9661   1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868  
##  Median : 0.1841   Median : 0.1101   Median :-0.07765   Median :-0.868  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.6049   3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116  
##  Max.   : 2.4939   Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116  
##        am               gear              carb        
##  Min.   :-0.8141   Min.   :-0.9318   Min.   :-1.1222  
##  1st Qu.:-0.8141   1st Qu.:-0.9318   1st Qu.:-0.5030  
##  Median :-0.8141   Median : 0.4236   Median :-0.5030  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1899   3rd Qu.: 0.4236   3rd Qu.: 0.7352  
##  Max.   : 1.1899   Max.   : 1.7789   Max.   : 3.2117

Notice that the mean of each column is 0.

  1. Compute the covariance matrix for the data.
df_cov <- cov(df)
head(df_cov)
##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000
##             qsec         vs         am       gear       carb
## mpg   0.41868403  0.6640389  0.5998324  0.4802848 -0.5509251
## cyl  -0.59124207 -0.8108118 -0.5226070 -0.4926866  0.5269883
## disp -0.43369788 -0.7104159 -0.5912270 -0.5555692  0.3949769
## hp   -0.70822339 -0.7230967 -0.2432043 -0.1257043  0.7498125
## drat  0.09120476  0.4402785  0.7127111  0.6996101 -0.0907898
## wt   -0.17471588 -0.5549157 -0.6924953 -0.5832870  0.4276059

The principal components we construct will be derived from this covariance matrix.

  1. Find the eigenvectors and eigenvalues of the covariance matrix.
df_eigens <- eigen(df_cov)
head(df_eigens)
## $values
##  [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
##  [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441
## 
## $vectors
##             [,1]        [,2]        [,3]         [,4]        [,5]        [,6]
##  [1,]  0.3625305 -0.01612440 -0.22574419  0.022540255 -0.10284468  0.10879743
##  [2,] -0.3739160 -0.04374371 -0.17531118  0.002591838 -0.05848381 -0.16855369
##  [3,] -0.3681852  0.04932413 -0.06148414 -0.256607885 -0.39399530  0.33616451
##  [4,] -0.3300569 -0.24878402  0.14001476  0.067676157 -0.54004744 -0.07143563
##  [5,]  0.2941514 -0.27469408  0.16118879 -0.854828743 -0.07732727 -0.24449705
##  [6,] -0.3461033  0.14303825  0.34181851 -0.245899314  0.07502912  0.46493964
##  [7,]  0.2004563  0.46337482  0.40316904 -0.068076532  0.16466591  0.33048032
##  [8,]  0.3065113  0.23164699  0.42881517  0.214848616 -0.59953955 -0.19401702
##  [9,]  0.2349429 -0.42941765 -0.20576657  0.030462908 -0.08978128  0.57081745
## [10,]  0.2069162 -0.46234863  0.28977993  0.264690521 -0.04832960  0.24356284
## [11,] -0.2140177 -0.41357106  0.52854459  0.126789179  0.36131875 -0.18352168
##               [,7]         [,8]         [,9]       [,10]        [,11]
##  [1,]  0.367723810 -0.754091423  0.235701617  0.13928524  0.124895628
##  [2,]  0.057277736 -0.230824925  0.054035270 -0.84641949  0.140695441
##  [3,]  0.214303077  0.001142134  0.198427848  0.04937979 -0.660606481
##  [4,] -0.001495989 -0.222358441 -0.575830072  0.24782351  0.256492062
##  [5,]  0.021119857  0.032193501 -0.046901228 -0.10149369  0.039530246
##  [6,] -0.020668302 -0.008571929  0.359498251  0.09439426  0.567448697
##  [7,]  0.050010522 -0.231840021 -0.528377185 -0.27067295 -0.181361780
##  [8,] -0.265780836  0.025935128  0.358582624 -0.15903909 -0.008414634
##  [9,] -0.587305101 -0.059746952 -0.047403982 -0.17778541 -0.029823537
## [10,]  0.605097617  0.336150240 -0.001735039 -0.21382515  0.053507085
## [11,] -0.174603192 -0.395629107  0.170640677  0.07225950 -0.319594676

The eigenvectors of the covariance matrix are the principal components. The function eigen() arranges the vectors in order from the one associated with the greatest eigenvalue, to the one associated with the least. This means that the first few eigenvectors are the principal components that account for a large fraction of the variability in the original data. They define the best surface on which to “cast the shadow of” (that is, project) the original data.

The first few principal components are:

df_eigens$vectors[,c(1:3)]
##             [,1]        [,2]        [,3]
##  [1,]  0.3625305 -0.01612440 -0.22574419
##  [2,] -0.3739160 -0.04374371 -0.17531118
##  [3,] -0.3681852  0.04932413 -0.06148414
##  [4,] -0.3300569 -0.24878402  0.14001476
##  [5,]  0.2941514 -0.27469408  0.16118879
##  [6,] -0.3461033  0.14303825  0.34181851
##  [7,]  0.2004563  0.46337482  0.40316904
##  [8,]  0.3065113  0.23164699  0.42881517
##  [9,]  0.2349429 -0.42941765 -0.20576657
## [10,]  0.2069162 -0.46234863  0.28977993
## [11,] -0.2140177 -0.41357106  0.52854459

Their associated eigenvalues are:

df_eigens$values[1:3]
## [1] 6.6084003 2.6504679 0.6271973

We can compare these results to those given by the prcomp function:

princomps <- prcomp(df)
princomps$rotation[,c(1:3)]
##             PC1         PC2         PC3
## mpg  -0.3625305  0.01612440 -0.22574419
## cyl   0.3739160  0.04374371 -0.17531118
## disp  0.3681852 -0.04932413 -0.06148414
## hp    0.3300569  0.24878402  0.14001476
## drat -0.2941514  0.27469408  0.16118879
## wt    0.3461033 -0.14303825  0.34181851
## qsec -0.2004563 -0.46337482  0.40316904
## vs   -0.3065113 -0.23164699  0.42881517
## am   -0.2349429  0.42941765 -0.20576657
## gear -0.2069162  0.46234863  0.28977993
## carb  0.2140177  0.41357106  0.52854459

The computed results match those given by the built-in function except for, in some cases, a sign inversion.