Principal components analysis is a technique for reducing the dimensionality of a dataset. Reducing dimensionality would be easy if we didn’t have to worry about the explanatory power of our model– just drop variables! But how can you reduce dimensionality while preserving within the remaining explanatory variables as much information as possible about the response? PCA is one strategy for doing this.
One everyday example of dimensionality reduction is the process of casting a shadow onto a surface. The shadow retains some information about the shape of the object, but exactly how much information is retained by the shadow depends on the positions of the light source, the object, and the surface on which the shadow is cast. The shadow of a person at noon might look like a puddle beneath their feet. But the shadow of that same person at sunset might be recognizable as a human, albeit elongated across the ground to the east.
PCA seeks to position a surface just right so that the shadow cast by high-dimensional data retains as much information as possible. It does this by identifying the dimensions along which the data vary the greatest– and incorporating those dimensions into the lower-dimensional surface where the reduced data will reside.
Let’s conduct PCA on the mtcars dataset, step by step. At the end, we’ll compare our results to those given by the built-in function prcomp. This blog post builds on work that I did for DATA 609.
df <- mtcars
df <- scale(df, scale = TRUE)
summary(df)
## mpg cyl disp hp
## Min. :-1.6079 Min. :-1.225 Min. :-1.2879 Min. :-1.3810
## 1st Qu.:-0.7741 1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.7320
## Median :-0.1478 Median :-0.105 Median :-0.2777 Median :-0.3455
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4495 3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.4859
## Max. : 2.2913 Max. : 1.015 Max. : 1.9468 Max. : 2.7466
## drat wt qsec vs
## Min. :-1.5646 Min. :-1.7418 Min. :-1.87401 Min. :-0.868
## 1st Qu.:-0.9661 1st Qu.:-0.6500 1st Qu.:-0.53513 1st Qu.:-0.868
## Median : 0.1841 Median : 0.1101 Median :-0.07765 Median :-0.868
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.6049 3rd Qu.: 0.4014 3rd Qu.: 0.58830 3rd Qu.: 1.116
## Max. : 2.4939 Max. : 2.2553 Max. : 2.82675 Max. : 1.116
## am gear carb
## Min. :-0.8141 Min. :-0.9318 Min. :-1.1222
## 1st Qu.:-0.8141 1st Qu.:-0.9318 1st Qu.:-0.5030
## Median :-0.8141 Median : 0.4236 Median :-0.5030
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1899 3rd Qu.: 0.4236 3rd Qu.: 0.7352
## Max. : 1.1899 Max. : 1.7789 Max. : 3.2117
Notice that the mean of each column is 0.
df_cov <- cov(df)
head(df_cov)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.5998324 0.4802848 -0.5509251
## cyl -0.59124207 -0.8108118 -0.5226070 -0.4926866 0.5269883
## disp -0.43369788 -0.7104159 -0.5912270 -0.5555692 0.3949769
## hp -0.70822339 -0.7230967 -0.2432043 -0.1257043 0.7498125
## drat 0.09120476 0.4402785 0.7127111 0.6996101 -0.0907898
## wt -0.17471588 -0.5549157 -0.6924953 -0.5832870 0.4276059
The principal components we construct will be derived from this covariance matrix.
df_eigens <- eigen(df_cov)
head(df_eigens)
## $values
## [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
## [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.3625305 -0.01612440 -0.22574419 0.022540255 -0.10284468 0.10879743
## [2,] -0.3739160 -0.04374371 -0.17531118 0.002591838 -0.05848381 -0.16855369
## [3,] -0.3681852 0.04932413 -0.06148414 -0.256607885 -0.39399530 0.33616451
## [4,] -0.3300569 -0.24878402 0.14001476 0.067676157 -0.54004744 -0.07143563
## [5,] 0.2941514 -0.27469408 0.16118879 -0.854828743 -0.07732727 -0.24449705
## [6,] -0.3461033 0.14303825 0.34181851 -0.245899314 0.07502912 0.46493964
## [7,] 0.2004563 0.46337482 0.40316904 -0.068076532 0.16466591 0.33048032
## [8,] 0.3065113 0.23164699 0.42881517 0.214848616 -0.59953955 -0.19401702
## [9,] 0.2349429 -0.42941765 -0.20576657 0.030462908 -0.08978128 0.57081745
## [10,] 0.2069162 -0.46234863 0.28977993 0.264690521 -0.04832960 0.24356284
## [11,] -0.2140177 -0.41357106 0.52854459 0.126789179 0.36131875 -0.18352168
## [,7] [,8] [,9] [,10] [,11]
## [1,] 0.367723810 -0.754091423 0.235701617 0.13928524 0.124895628
## [2,] 0.057277736 -0.230824925 0.054035270 -0.84641949 0.140695441
## [3,] 0.214303077 0.001142134 0.198427848 0.04937979 -0.660606481
## [4,] -0.001495989 -0.222358441 -0.575830072 0.24782351 0.256492062
## [5,] 0.021119857 0.032193501 -0.046901228 -0.10149369 0.039530246
## [6,] -0.020668302 -0.008571929 0.359498251 0.09439426 0.567448697
## [7,] 0.050010522 -0.231840021 -0.528377185 -0.27067295 -0.181361780
## [8,] -0.265780836 0.025935128 0.358582624 -0.15903909 -0.008414634
## [9,] -0.587305101 -0.059746952 -0.047403982 -0.17778541 -0.029823537
## [10,] 0.605097617 0.336150240 -0.001735039 -0.21382515 0.053507085
## [11,] -0.174603192 -0.395629107 0.170640677 0.07225950 -0.319594676
The eigenvectors of the covariance matrix are the principal components. The function eigen() arranges the vectors in order from the one associated with the greatest eigenvalue, to the one associated with the least. This means that the first few eigenvectors are the principal components that account for a large fraction of the variability in the original data. They define the best surface on which to “cast the shadow of” (that is, project) the original data.
The first few principal components are:
df_eigens$vectors[,c(1:3)]
## [,1] [,2] [,3]
## [1,] 0.3625305 -0.01612440 -0.22574419
## [2,] -0.3739160 -0.04374371 -0.17531118
## [3,] -0.3681852 0.04932413 -0.06148414
## [4,] -0.3300569 -0.24878402 0.14001476
## [5,] 0.2941514 -0.27469408 0.16118879
## [6,] -0.3461033 0.14303825 0.34181851
## [7,] 0.2004563 0.46337482 0.40316904
## [8,] 0.3065113 0.23164699 0.42881517
## [9,] 0.2349429 -0.42941765 -0.20576657
## [10,] 0.2069162 -0.46234863 0.28977993
## [11,] -0.2140177 -0.41357106 0.52854459
Their associated eigenvalues are:
df_eigens$values[1:3]
## [1] 6.6084003 2.6504679 0.6271973
We can compare these results to those given by the prcomp function:
princomps <- prcomp(df)
princomps$rotation[,c(1:3)]
## PC1 PC2 PC3
## mpg -0.3625305 0.01612440 -0.22574419
## cyl 0.3739160 0.04374371 -0.17531118
## disp 0.3681852 -0.04932413 -0.06148414
## hp 0.3300569 0.24878402 0.14001476
## drat -0.2941514 0.27469408 0.16118879
## wt 0.3461033 -0.14303825 0.34181851
## qsec -0.2004563 -0.46337482 0.40316904
## vs -0.3065113 -0.23164699 0.42881517
## am -0.2349429 0.42941765 -0.20576657
## gear -0.2069162 0.46234863 0.28977993
## carb 0.2140177 0.41357106 0.52854459
The computed results match those given by the built-in function except for, in some cases, a sign inversion.