DATA 621 blog 4

Step-by-step

Let’s conduct PCA on the mtcars dataset, step by step. At the end, we’ll compare our results to those given by the built-in function prcomp. This blog post builds on work that I did for DATA 609.

Subtract the mean from each column of data, so that each column’s mean is 0. Divide each column by its standard deviation, so that the standard deviation of each column is 1.

df <- mtcars
df <- scale(df, scale = TRUE)
summary(df)

##       mpg               cyl              disp               hp         
##  Min.   :-1.6079   Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810  
##  1st Qu.:-0.7741   1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320  
##  Median :-0.1478   Median :-0.105   Median :-0.2777   Median :-0.3455  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859  
##  Max.   : 2.2913   Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466  
##       drat               wt               qsec                vs        
##  Min.   :-1.5646   Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868  
##  1st Qu.:-0.9661   1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868  
##  Median : 0.1841   Median : 0.1101   Median :-0.07765   Median :-0.868  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.6049   3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116  
##  Max.   : 2.4939   Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116  
##        am               gear              carb        
##  Min.   :-0.8141   Min.   :-0.9318   Min.   :-1.1222  
##  1st Qu.:-0.8141   1st Qu.:-0.9318   1st Qu.:-0.5030  
##  Median :-0.8141   Median : 0.4236   Median :-0.5030  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1899   3rd Qu.: 0.4236   3rd Qu.: 0.7352  
##  Max.   : 1.1899   Max.   : 1.7789   Max.   : 3.2117

Notice that the mean of each column is 0.

Compute the covariance matrix for the data.

df_cov <- cov(df)
head(df_cov)

##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000
##             qsec         vs         am       gear       carb
## mpg   0.41868403  0.6640389  0.5998324  0.4802848 -0.5509251
## cyl  -0.59124207 -0.8108118 -0.5226070 -0.4926866  0.5269883
## disp -0.43369788 -0.7104159 -0.5912270 -0.5555692  0.3949769
## hp   -0.70822339 -0.7230967 -0.2432043 -0.1257043  0.7498125
## drat  0.09120476  0.4402785  0.7127111  0.6996101 -0.0907898
## wt   -0.17471588 -0.5549157 -0.6924953 -0.5832870  0.4276059

The principal components we construct will be derived from this covariance matrix.

Find the eigenvectors and eigenvalues of the covariance matrix.

df_eigens <- eigen(df_cov)
head(df_eigens)

## $values
##  [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
##  [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441
## 
## $vectors
##             [,1]        [,2]        [,3]         [,4]        [,5]        [,6]
##  [1,]  0.3625305 -0.01612440 -0.22574419  0.022540255 -0.10284468  0.10879743
##  [2,] -0.3739160 -0.04374371 -0.17531118  0.002591838 -0.05848381 -0.16855369
##  [3,] -0.3681852  0.04932413 -0.06148414 -0.256607885 -0.39399530  0.33616451
##  [4,] -0.3300569 -0.24878402  0.14001476  0.067676157 -0.54004744 -0.07143563
##  [5,]  0.2941514 -0.27469408  0.16118879 -0.854828743 -0.07732727 -0.24449705
##  [6,] -0.3461033  0.14303825  0.34181851 -0.245899314  0.07502912  0.46493964
##  [7,]  0.2004563  0.46337482  0.40316904 -0.068076532  0.16466591  0.33048032
##  [8,]  0.3065113  0.23164699  0.42881517  0.214848616 -0.59953955 -0.19401702
##  [9,]  0.2349429 -0.42941765 -0.20576657  0.030462908 -0.08978128  0.57081745
## [10,]  0.2069162 -0.46234863  0.28977993  0.264690521 -0.04832960  0.24356284
## [11,] -0.2140177 -0.41357106  0.52854459  0.126789179  0.36131875 -0.18352168
##               [,7]         [,8]         [,9]       [,10]        [,11]
##  [1,]  0.367723810 -0.754091423  0.235701617  0.13928524  0.124895628
##  [2,]  0.057277736 -0.230824925  0.054035270 -0.84641949  0.140695441
##  [3,]  0.214303077  0.001142134  0.198427848  0.04937979 -0.660606481
##  [4,] -0.001495989 -0.222358441 -0.575830072  0.24782351  0.256492062
##  [5,]  0.021119857  0.032193501 -0.046901228 -0.10149369  0.039530246
##  [6,] -0.020668302 -0.008571929  0.359498251  0.09439426  0.567448697
##  [7,]  0.050010522 -0.231840021 -0.528377185 -0.27067295 -0.181361780
##  [8,] -0.265780836  0.025935128  0.358582624 -0.15903909 -0.008414634
##  [9,] -0.587305101 -0.059746952 -0.047403982 -0.17778541 -0.029823537
## [10,]  0.605097617  0.336150240 -0.001735039 -0.21382515  0.053507085
## [11,] -0.174603192 -0.395629107  0.170640677  0.07225950 -0.319594676

The eigenvectors of the covariance matrix are the principal components. The function eigen() arranges the vectors in order from the one associated with the greatest eigenvalue, to the one associated with the least. This means that the first few eigenvectors are the principal components that account for a large fraction of the variability in the original data. They define the best surface on which to “cast the shadow of” (that is, project) the original data.

The first few principal components are:

df_eigens$vectors[,c(1:3)]

##             [,1]        [,2]        [,3]
##  [1,]  0.3625305 -0.01612440 -0.22574419
##  [2,] -0.3739160 -0.04374371 -0.17531118
##  [3,] -0.3681852  0.04932413 -0.06148414
##  [4,] -0.3300569 -0.24878402  0.14001476
##  [5,]  0.2941514 -0.27469408  0.16118879
##  [6,] -0.3461033  0.14303825  0.34181851
##  [7,]  0.2004563  0.46337482  0.40316904
##  [8,]  0.3065113  0.23164699  0.42881517
##  [9,]  0.2349429 -0.42941765 -0.20576657
## [10,]  0.2069162 -0.46234863  0.28977993
## [11,] -0.2140177 -0.41357106  0.52854459

Their associated eigenvalues are:

df_eigens$values[1:3]

## [1] 6.6084003 2.6504679 0.6271973

We can compare these results to those given by the prcomp function:

princomps <- prcomp(df)
princomps$rotation[,c(1:3)]

##             PC1         PC2         PC3
## mpg  -0.3625305  0.01612440 -0.22574419
## cyl   0.3739160  0.04374371 -0.17531118
## disp  0.3681852 -0.04932413 -0.06148414
## hp    0.3300569  0.24878402  0.14001476
## drat -0.2941514  0.27469408  0.16118879
## wt    0.3461033 -0.14303825  0.34181851
## qsec -0.2004563 -0.46337482  0.40316904
## vs   -0.3065113 -0.23164699  0.42881517
## am   -0.2349429  0.42941765 -0.20576657
## gear -0.2069162  0.46234863  0.28977993
## carb  0.2140177  0.41357106  0.52854459

The computed results match those given by the built-in function except for, in some cases, a sign inversion.

DATA 621 blog 4

Daniel Moscoe

11/27/2021

Principal Components Analysis

Step-by-step