Principal Components Analysis in R

Principal Components Analysis(PCA) is used in exploratory data analysis and for making decisions in predictive models.

PCA commonly used for dimensionality reduction by using each data point onto only the first few principal components (most cases first and second dimensions) to obtain lower-dimensional data while keeping as much of the data’s variation as possible.

The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.

The principal components are often analyzed by eigendecomposition of the data covariance matrix or singular value decomposition (SVD) of the data matrix.

This time We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria: Climate and Terrain, Housing, Health Care & Environment, Crime, Transportation, Education, The Arts, Recreation and Economics.

Getting Data

places <- read.table("D:/MARV BS MATH/4th year, 2nd sem/Nonparametric Statistics/places.txt", header=FALSE, sep = '')
paged_table(places)
data("places")
Warning in data("places"): data set 'places' not found
str(places)
'data.frame':   329 obs. of  10 variables:
 $ V1 : int  521 575 468 476 659 520 559 537 561 609 ...
 $ V2 : int  6200 8138 7339 7908 8393 5819 8288 6487 6191 6546 ...
 $ V3 : int  237 1656 618 1431 1853 640 621 965 432 669 ...
 $ V4 : int  923 886 970 610 1483 727 514 706 399 1073 ...
 $ V5 : int  4031 4883 2531 6883 6558 2444 2881 4975 4246 4902 ...
 $ V6 : int  2757 2438 2560 3399 3026 2972 3144 2945 2778 2852 ...
 $ V7 : int  996 5564 237 4655 4496 334 2333 1487 256 1235 ...
 $ V8 : int  1405 2632 859 1617 2612 1018 1117 1280 1210 1109 ...
 $ V9 : int  7633 4350 5250 5864 5727 5254 5097 5795 4230 6241 ...
 $ V10: int  1 2 3 4 5 6 7 8 9 10 ...

This datasets contains 329 observations with 10 variables.

summary(places)
       V1              V2              V3             V4               V5      
 Min.   :105.0   Min.   : 5159   Min.   :  43   Min.   : 308.0   Min.   :1145  
 1st Qu.:480.0   1st Qu.: 6760   1st Qu.: 583   1st Qu.: 707.0   1st Qu.:3141  
 Median :542.0   Median : 7877   Median : 833   Median : 947.0   Median :4080  
 Mean   :538.7   Mean   : 8347   Mean   :1186   Mean   : 961.1   Mean   :4210  
 3rd Qu.:592.0   3rd Qu.: 9015   3rd Qu.:1445   3rd Qu.:1156.0   3rd Qu.:5205  
 Max.   :910.0   Max.   :23640   Max.   :7850   Max.   :2498.0   Max.   :8625  
       V6             V7              V8             V9            V10     
 Min.   :1701   Min.   :   52   Min.   : 300   Min.   :3045   Min.   :  1  
 1st Qu.:2619   1st Qu.:  778   1st Qu.:1316   1st Qu.:4842   1st Qu.: 83  
 Median :2794   Median : 1871   Median :1670   Median :5384   Median :165  
 Mean   :2815   Mean   : 3151   Mean   :1846   Mean   :5525   Mean   :165  
 3rd Qu.:3012   3rd Qu.: 3844   3rd Qu.:2176   3rd Qu.:6113   3rd Qu.:247  
 Max.   :3781   Max.   :56745   Max.   :4800   Max.   :9980   Max.   :329  
head(places)
   V1   V2   V3   V4   V5   V6   V7   V8   V9 V10
1 521 6200  237  923 4031 2757  996 1405 7633   1
2 575 8138 1656  886 4883 2438 5564 2632 4350   2
3 468 7339  618  970 2531 2560  237  859 5250   3
4 476 7908 1431  610 6883 3399 4655 1617 5864   4
5 659 8393 1853 1483 6558 3026 4496 2612 5727   5
6 520 5819  640  727 2444 2972  334 1018 5254   6

Now, We will apply PCA to the nine continuous variables and use the categorical variable to visualize the PCs later. Observe that in the following code we apply a log transformation to the continuous variables and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

Log Transform
log.pl <- log(places[, 1:9])
paged_table(log.pl)
Apply PCA - scale. = TRUE is highly advisable, but default is FALSE.
pl.pca <- prcomp(log.pl,
                 center = TRUE,
                 scale. = TRUE)
pl.pca
Standard deviations (1, .., p=9):
[1] 1.8159827 1.1016178 1.0514418 0.9525124 0.9277008 0.7497905 0.6955721
[8] 0.5639789 0.5011269

Rotation (n x k) = (9 x 9):
         PC1         PC2         PC3         PC4         PC5           PC6
V1 0.1579414  0.06862938 -0.79970997  0.37680952 -0.04104588  0.2166949681
V2 0.3844053  0.13920883 -0.07961647  0.19654301  0.57986793 -0.0822200812
V3 0.4099096 -0.37181203  0.01947537  0.11252206 -0.02956935 -0.5348756017
V4 0.2591017  0.47413246 -0.12846722 -0.04229962 -0.69217100 -0.1399009169
V5 0.3748890 -0.14148642  0.14106828 -0.43007675 -0.19141608  0.3238913974
V6 0.2743254 -0.45235526  0.24105584  0.45694297 -0.22474374  0.5265827320
V7 0.4738471 -0.10441020 -0.01102628 -0.14688130 -0.01193024 -0.3210570706
V8 0.3534118  0.29194243 -0.04181639 -0.40401889  0.30565371  0.3941387718
V9 0.1640135  0.54045312  0.50731026  0.47578009  0.03710776 -0.0009737383
          PC7        PC8         PC9
V1 -0.1513516 -0.3411282 -0.03009755
V2 -0.2751971  0.6061010  0.04226906
V3  0.1349750 -0.1500575 -0.59412763
V4  0.1095036  0.4201255 -0.05101188
V5 -0.6785670 -0.1188325 -0.13584327
V6  0.2620958  0.2111749  0.11012420
V7  0.1204986 -0.2598673  0.74672678
V8  0.5530938 -0.1377181 -0.22636544
V9 -0.1468669 -0.4147736 -0.04790278

Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA.

Plot Method
plot(pl.pca, type = "l")

The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure above is useful to decide how many PCs to retain for further analysis. In this case we have only 9 PCs.

Summary Method
summary(pl.pca)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.8160 1.1016 1.0514 0.9525 0.92770 0.74979 0.69557
Proportion of Variance 0.3664 0.1348 0.1228 0.1008 0.09563 0.06247 0.05376
Cumulative Proportion  0.3664 0.5013 0.6241 0.7249 0.82053 0.88300 0.93676
                           PC8    PC9
Standard deviation     0.56398 0.5011
Proportion of Variance 0.03534 0.0279
Cumulative Proportion  0.97210 1.0000

The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance.

Predict PCs
predict(pl.pca, 
        newdata=tail(log.pl, 2))
           PC1       PC2       PC3        PC4        PC5        PC6        PC7
328 -0.3577229 -1.332915 -1.044542 -0.2574776 -0.6787995 -0.3418898  0.5519518
329 -2.9065986  1.253365 -1.345258  0.2585665 -0.4538613  0.1515577 -0.8514209
          PC8         PC9
328 0.3463322  0.43894223
329 1.1935587 -0.08967011

We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the places rated data has just arrived and we want to see what is their PCs values.

Biplot of the Principal Components
biplot(pl.pca)