Principal Components Analysis(PCA) is used in exploratory data analysis and for making decisions in predictive models.
PCA commonly used for dimensionality reduction by using each data point onto only the first few principal components (most cases first and second dimensions) to obtain lower-dimensional data while keeping as much of the data’s variation as possible.
The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.
The principal components are often analyzed by eigendecomposition of the data covariance matrix or singular value decomposition (SVD) of the data matrix.
This time We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria: Climate and Terrain, Housing, Health Care & Environment, Crime, Transportation, Education, The Arts, Recreation and Economics.
places <- read.table("D:/MARV BS MATH/4th year, 2nd sem/Nonparametric Statistics/places.txt", header=FALSE, sep = '')
paged_table(places)
data("places")
Warning in data("places"): data set 'places' not found
str(places)
'data.frame': 329 obs. of 10 variables:
$ V1 : int 521 575 468 476 659 520 559 537 561 609 ...
$ V2 : int 6200 8138 7339 7908 8393 5819 8288 6487 6191 6546 ...
$ V3 : int 237 1656 618 1431 1853 640 621 965 432 669 ...
$ V4 : int 923 886 970 610 1483 727 514 706 399 1073 ...
$ V5 : int 4031 4883 2531 6883 6558 2444 2881 4975 4246 4902 ...
$ V6 : int 2757 2438 2560 3399 3026 2972 3144 2945 2778 2852 ...
$ V7 : int 996 5564 237 4655 4496 334 2333 1487 256 1235 ...
$ V8 : int 1405 2632 859 1617 2612 1018 1117 1280 1210 1109 ...
$ V9 : int 7633 4350 5250 5864 5727 5254 5097 5795 4230 6241 ...
$ V10: int 1 2 3 4 5 6 7 8 9 10 ...
summary(places)
V1 V2 V3 V4 V5
Min. :105.0 Min. : 5159 Min. : 43 Min. : 308.0 Min. :1145
1st Qu.:480.0 1st Qu.: 6760 1st Qu.: 583 1st Qu.: 707.0 1st Qu.:3141
Median :542.0 Median : 7877 Median : 833 Median : 947.0 Median :4080
Mean :538.7 Mean : 8347 Mean :1186 Mean : 961.1 Mean :4210
3rd Qu.:592.0 3rd Qu.: 9015 3rd Qu.:1445 3rd Qu.:1156.0 3rd Qu.:5205
Max. :910.0 Max. :23640 Max. :7850 Max. :2498.0 Max. :8625
V6 V7 V8 V9 V10
Min. :1701 Min. : 52 Min. : 300 Min. :3045 Min. : 1
1st Qu.:2619 1st Qu.: 778 1st Qu.:1316 1st Qu.:4842 1st Qu.: 83
Median :2794 Median : 1871 Median :1670 Median :5384 Median :165
Mean :2815 Mean : 3151 Mean :1846 Mean :5525 Mean :165
3rd Qu.:3012 3rd Qu.: 3844 3rd Qu.:2176 3rd Qu.:6113 3rd Qu.:247
Max. :3781 Max. :56745 Max. :4800 Max. :9980 Max. :329
head(places)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 521 6200 237 923 4031 2757 996 1405 7633 1
2 575 8138 1656 886 4883 2438 5564 2632 4350 2
3 468 7339 618 970 2531 2560 237 859 5250 3
4 476 7908 1431 610 6883 3399 4655 1617 5864 4
5 659 8393 1853 1483 6558 3026 4496 2612 5727 5
6 520 5819 640 727 2444 2972 334 1018 5254 6
Now, We will apply PCA to the nine continuous variables and use the categorical variable to visualize the PCs later. Observe that in the following code we apply a log transformation to the continuous variables and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:
log.pl <- log(places[, 1:9])
paged_table(log.pl)
pl.pca <- prcomp(log.pl,
center = TRUE,
scale. = TRUE)
pl.pca
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269
Rotation (n x k) = (9 x 9):
PC1 PC2 PC3 PC4 PC5 PC6
V1 0.1579414 0.06862938 -0.79970997 0.37680952 -0.04104588 0.2166949681
V2 0.3844053 0.13920883 -0.07961647 0.19654301 0.57986793 -0.0822200812
V3 0.4099096 -0.37181203 0.01947537 0.11252206 -0.02956935 -0.5348756017
V4 0.2591017 0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642 0.14106828 -0.43007675 -0.19141608 0.3238913974
V6 0.2743254 -0.45235526 0.24105584 0.45694297 -0.22474374 0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118 0.29194243 -0.04181639 -0.40401889 0.30565371 0.3941387718
V9 0.1640135 0.54045312 0.50731026 0.47578009 0.03710776 -0.0009737383
PC7 PC8 PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971 0.6061010 0.04226906
V3 0.1349750 -0.1500575 -0.59412763
V4 0.1095036 0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6 0.2620958 0.2111749 0.11012420
V7 0.1204986 -0.2598673 0.74672678
V8 0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278
Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA.
print(pl.pca)
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269
Rotation (n x k) = (9 x 9):
PC1 PC2 PC3 PC4 PC5 PC6
V1 0.1579414 0.06862938 -0.79970997 0.37680952 -0.04104588 0.2166949681
V2 0.3844053 0.13920883 -0.07961647 0.19654301 0.57986793 -0.0822200812
V3 0.4099096 -0.37181203 0.01947537 0.11252206 -0.02956935 -0.5348756017
V4 0.2591017 0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642 0.14106828 -0.43007675 -0.19141608 0.3238913974
V6 0.2743254 -0.45235526 0.24105584 0.45694297 -0.22474374 0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118 0.29194243 -0.04181639 -0.40401889 0.30565371 0.3941387718
V9 0.1640135 0.54045312 0.50731026 0.47578009 0.03710776 -0.0009737383
PC7 PC8 PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971 0.6061010 0.04226906
V3 0.1349750 -0.1500575 -0.59412763
V4 0.1095036 0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6 0.2620958 0.2111749 0.11012420
V7 0.1204986 -0.2598673 0.74672678
V8 0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278
The print method returns the standard deviation of each of the nine PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.
plot(pl.pca, type = "l")
The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure above is useful to decide how many PCs to retain for further analysis. In this case we have only 9 PCs.
summary(pl.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8160 1.1016 1.0514 0.9525 0.92770 0.74979 0.69557
Proportion of Variance 0.3664 0.1348 0.1228 0.1008 0.09563 0.06247 0.05376
Cumulative Proportion 0.3664 0.5013 0.6241 0.7249 0.82053 0.88300 0.93676
PC8 PC9
Standard deviation 0.56398 0.5011
Proportion of Variance 0.03534 0.0279
Cumulative Proportion 0.97210 1.0000
The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance.
predict(pl.pca,
newdata=tail(log.pl, 2))
PC1 PC2 PC3 PC4 PC5 PC6 PC7
328 -0.3577229 -1.332915 -1.044542 -0.2574776 -0.6787995 -0.3418898 0.5519518
329 -2.9065986 1.253365 -1.345258 0.2585665 -0.4538613 0.1515577 -0.8514209
PC8 PC9
328 0.3463322 0.43894223
329 1.1935587 -0.08967011
We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the places rated data has just arrived and we want to see what is their PCs values.
biplot(pl.pca)