Principal Components Analysis in R

We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria: Climate and Terrain, Housing, Health Care & Environment, Crime, Transportation, Education, The Arts, Recreation and Economics.

Load the Data
places <- read.table("D:/Stat 56/places.txt", header=FALSE, sep = '')
paged_table(places)
head(places)
   V1   V2   V3   V4   V5   V6   V7   V8   V9 V10
1 521 6200  237  923 4031 2757  996 1405 7633   1
2 575 8138 1656  886 4883 2438 5564 2632 4350   2
3 468 7339  618  970 2531 2560  237  859 5250   3
4 476 7908 1431  610 6883 3399 4655 1617 5864   4
5 659 8393 1853 1483 6558 3026 4496 2612 5727   5
6 520 5819  640  727 2444 2972  334 1018 5254   6

We will apply PCA to the nine continuous variables and use the categorical variable to visualize the PCs later. Notice that in the following code we apply a log transformation to the continuous variables and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

Log Transform
log.pl <- log(places[, 1:9])
paged_table(log.pl)
Apply PCA - scale. = TRUE is highly advisable, but default is FALSE.
pl.pca <- prcomp(log.pl,
                 center = TRUE,
                 scale. = TRUE)
pl.pca
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269

Rotation (n x k) = (9 x 9):
         PC1         PC2         PC3         PC4         PC5           PC6
V1 0.1579414  0.06862938 -0.79970997  0.37680952 -0.04104588  0.2166949681
V2 0.3844053  0.13920883 -0.07961647  0.19654301  0.57986793 -0.0822200812
V3 0.4099096 -0.37181203  0.01947537  0.11252206 -0.02956935 -0.5348756017
V4 0.2591017  0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642  0.14106828 -0.43007675 -0.19141608  0.3238913974
V6 0.2743254 -0.45235526  0.24105584  0.45694297 -0.22474374  0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118  0.29194243 -0.04181639 -0.40401889  0.30565371  0.3941387718
V9 0.1640135  0.54045312  0.50731026  0.47578009  0.03710776 -0.0009737383
          PC7        PC8         PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971  0.6061010  0.04226906
V3  0.1349750 -0.1500575 -0.59412763
V4  0.1095036  0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6  0.2620958  0.2111749  0.11012420
V7  0.1204986 -0.2598673  0.74672678
V8  0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278

Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA.

Plot Method
plot(pl.pca, type = "l")

The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure above is useful to decide how many PCs to retain for further analysis. In this case we have only 9 PCs.

Summary Method
summary(pl.pca)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.8160 1.1016 1.0514 0.9525 0.92770 0.74979 0.69557
Proportion of Variance 0.3664 0.1348 0.1228 0.1008 0.09563 0.06247 0.05376
Cumulative Proportion  0.3664 0.5013 0.6241 0.7249 0.82053 0.88300 0.93676
                           PC8    PC9
Standard deviation     0.56398 0.5011
Proportion of Variance 0.03534 0.0279
Cumulative Proportion  0.97210 1.0000

The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance.

Predict PCs
predict(pl.pca, 
        newdata=tail(log.pl, 2))
           PC1       PC2       PC3        PC4        PC5        PC6        PC7
328 -0.3577229 -1.332915 -1.044542 -0.2574776 -0.6787995 -0.3418898  0.5519518
329 -2.9065986  1.253365 -1.345258  0.2585665 -0.4538613  0.1515577 -0.8514209
          PC8         PC9
328 0.3463322  0.43894223
329 1.1935587 -0.08967011

We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the places rated data has just arrived and we want to see what is their PCs values.

Biplot of the Principal Components
biplot(pl.pca)