We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria: Climate and Terrain, Housing, Health Care & Environment, Crime, Transportation, Education, The Arts, Recreation and Economics.
places <- read.table("D:/Stat 56/places.txt", header=FALSE, sep = '')
paged_table(places)
head(places)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 521 6200 237 923 4031 2757 996 1405 7633 1
2 575 8138 1656 886 4883 2438 5564 2632 4350 2
3 468 7339 618 970 2531 2560 237 859 5250 3
4 476 7908 1431 610 6883 3399 4655 1617 5864 4
5 659 8393 1853 1483 6558 3026 4496 2612 5727 5
6 520 5819 640 727 2444 2972 334 1018 5254 6
We will apply PCA to the nine continuous variables and use the categorical variable to visualize the PCs later. Notice that in the following code we apply a log transformation to the continuous variables and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:
log.pl <- log(places[, 1:9])
paged_table(log.pl)
pl.pca <- prcomp(log.pl,
center = TRUE,
scale. = TRUE)
pl.pca
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269
Rotation (n x k) = (9 x 9):
PC1 PC2 PC3 PC4 PC5 PC6
V1 0.1579414 0.06862938 -0.79970997 0.37680952 -0.04104588 0.2166949681
V2 0.3844053 0.13920883 -0.07961647 0.19654301 0.57986793 -0.0822200812
V3 0.4099096 -0.37181203 0.01947537 0.11252206 -0.02956935 -0.5348756017
V4 0.2591017 0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642 0.14106828 -0.43007675 -0.19141608 0.3238913974
V6 0.2743254 -0.45235526 0.24105584 0.45694297 -0.22474374 0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118 0.29194243 -0.04181639 -0.40401889 0.30565371 0.3941387718
V9 0.1640135 0.54045312 0.50731026 0.47578009 0.03710776 -0.0009737383
PC7 PC8 PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971 0.6061010 0.04226906
V3 0.1349750 -0.1500575 -0.59412763
V4 0.1095036 0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6 0.2620958 0.2111749 0.11012420
V7 0.1204986 -0.2598673 0.74672678
V8 0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278
Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA.
print(pl.pca)
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269
Rotation (n x k) = (9 x 9):
PC1 PC2 PC3 PC4 PC5 PC6
V1 0.1579414 0.06862938 -0.79970997 0.37680952 -0.04104588 0.2166949681
V2 0.3844053 0.13920883 -0.07961647 0.19654301 0.57986793 -0.0822200812
V3 0.4099096 -0.37181203 0.01947537 0.11252206 -0.02956935 -0.5348756017
V4 0.2591017 0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642 0.14106828 -0.43007675 -0.19141608 0.3238913974
V6 0.2743254 -0.45235526 0.24105584 0.45694297 -0.22474374 0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118 0.29194243 -0.04181639 -0.40401889 0.30565371 0.3941387718
V9 0.1640135 0.54045312 0.50731026 0.47578009 0.03710776 -0.0009737383
PC7 PC8 PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971 0.6061010 0.04226906
V3 0.1349750 -0.1500575 -0.59412763
V4 0.1095036 0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6 0.2620958 0.2111749 0.11012420
V7 0.1204986 -0.2598673 0.74672678
V8 0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278
The print method returns the standard deviation of each of the nine PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.
plot(pl.pca, type = "l")
The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure above is useful to decide how many PCs to retain for further analysis. In this case we have only 9 PCs.
summary(pl.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8160 1.1016 1.0514 0.9525 0.92770 0.74979 0.69557
Proportion of Variance 0.3664 0.1348 0.1228 0.1008 0.09563 0.06247 0.05376
Cumulative Proportion 0.3664 0.5013 0.6241 0.7249 0.82053 0.88300 0.93676
PC8 PC9
Standard deviation 0.56398 0.5011
Proportion of Variance 0.03534 0.0279
Cumulative Proportion 0.97210 1.0000
The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance.
predict(pl.pca,
newdata=tail(log.pl, 2))
PC1 PC2 PC3 PC4 PC5 PC6 PC7
328 -0.3577229 -1.332915 -1.044542 -0.2574776 -0.6787995 -0.3418898 0.5519518
329 -2.9065986 1.253365 -1.345258 0.2585665 -0.4538613 0.1515577 -0.8514209
PC8 PC9
328 0.3463322 0.43894223
329 1.1935587 -0.08967011
We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the places rated data has just arrived and we want to see what is their PCs values.
biplot(pl.pca)