Preliminary

We will use the DescTools, caret and factoextra packages in the lesson that follows.

First, we load the packages for use.

library(caret)
library(DescTools)
library(factoextra)

Principal Components Analysis

Example 1

The USArrests data set, which is a built-in dataset, contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. We would like to analyze the data, but there are high correlations among the variables.

data("USArrests")

Exploration and Preprocessing

We use the Abstract() function from the DescTools package to obtain information about the dataframe.

Abstract(x = USArrests) 
## ------------------------------------------------------------------------------ 
## USArrests
## 
## data frame:  50 obs. of  4 variables
##      50 complete cases (100.0%)
## 
##   Nr  ColName   Class    NAs  Levels
##   1   Murder    numeric  .          
##   2   Assault   integer  .          
##   3   UrbanPop  integer  .          
##   4   Rape      numeric  .

We can use the summary() function to obtain basic summary statistics.

summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

We can use the pairs() function to visually assess relationships among our variables.

pairs(USArrests, 
      upper.panel = NULL)

We can use the cor() function to obtain a correlation matrix. We use the round() function and round the matrix values to one decimal place (digits = 1).

round(cor(x = USArrests), 
      digits = 1) # rounded to 1 decimal place
##          Murder Assault UrbanPop Rape
## Murder      1.0     0.8      0.1  0.6
## Assault     0.8     1.0      0.3  0.7
## UrbanPop    0.1     0.3      1.0  0.4
## Rape        0.6     0.7      0.4  1.0

Analysis

The prcomp() function in the stats package performs PCA. By default, the function will center the data. To scale the data so that it is standardized, we set the scale. argument to TRUE (scale. = TRUE).

pc <- prcomp(x = USArrests, 
             scale. = TRUE)

The summary() function provides information about the variance explained by the PCs.

summary(pc)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

The importance component of the summary() output gives us information about how much variance is explained.

var_info <- summary(pc)
var_info$importance
##                             PC1       PC2       PC3       PC4
## Standard deviation     1.574878 0.9948694 0.5971291 0.4164494
## Proportion of Variance 0.620060 0.2474400 0.0891400 0.0433600
## Cumulative Proportion  0.620060 0.8675000 0.9566400 1.0000000

We can use a Scree Plot to visualize the proportion of variance explained by our PCs, either using the fviz_eig() function from the factoextra package or the screeplot() function from the stats package. Based on the output, we have an elbow point at 3.

fviz_eig(X = pc)

Another way to choose the number of PCs is to create a Scree Plot that includes the cumulative proportion of variance explained. Here, we can choose a cutoff point for the minimum proportion of variance that we would like to retain. We can choose the minimum number of PCs that meets that criteria.

barplot(var_info$importance[2, ], # plot PVE 
        names.arg = 1:4, # 4 PCS
        ylim = c(0, 1.1), 
        main = "Cumulative Variance Explained",
        xlab = "Principal Component",
        ylab = "Proportion of Variance Explained")
lines(var_info$importance[3, ], # add cumulative PVE
      type = "o",
      pch = 16)
abline(h = .8, col = "red", lty = 2) # add cutoff at .8

Describing the Solution

PC loading information is available for each of the variables and each of the PCs. It is contained in the rotation component of an object created using the prcomp() function. We can focus on the first 2 PCs by subsetting the column dimension.

pc$rotation[ ,1:2] # Loadings for 1st 2 PCs
##                 PC1        PC2
## Murder   -0.5358995  0.4181809
## Assault  -0.5831836  0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape     -0.5434321 -0.1673186

Based on the output, we have the following PC loadings:

  • PC1: Murder, Assault, Rape
  • PC2: UrbanPop

Based on this, we would describe the first PC as ‘Crime’ and the second PC as ‘Population’.

We can create a Loading plot for our first 2 PCs using the fviz_pca_var() function in the factoextra package.

fviz_pca_var(X = pc)

For each of our PCs, we have PC scores for each of our observations. We can obtain it in the x component of an object created using the prcomp() function.

head(pc$x[, 1:2])
##                   PC1        PC2
## Alabama    -0.9756604  1.1220012
## Alaska     -1.9305379  1.0624269
## Arizona    -1.7454429 -0.7384595
## Arkansas    0.1399989  1.1085423
## California -2.4986128 -1.5274267
## Colorado   -1.4993407 -0.9776297

We can plot the PC scores for the first 2 PCs using the fviz_pca_ind() function in the factoextra package.

fviz_pca_ind(X = pc, 
             labelsize = 2)

We can use a biplot to view the variables and observations together for the first 2 PCs. We use the biplot() function to create the plot.

biplot(x = pc, 
       scale = 0, 
       cex = .5) # reduce size of text

Example 2

Data was collected on the nutritional information and consumer ratings of 77 breakfast cereal types. A supermarket would like to perform PCA on the 12 numeric variables to reduce the dimensionality. Then, they would like to predict the rating variable using Linear Regression.

The data is available in the “Cereals.csv” file. We import the data as a dataframe named cereal. The row names are in the first column, so we set row.names = 1.

cereal <- read.csv(file = "Cereals.csv",
                   row.names = 1)

Exploration and Preprocessing

We can first obtain information about our cereal dataframe.

Abstract(x = cereal)
## ------------------------------------------------------------------------------ 
## cereal
## 
## data frame:  77 obs. of  15 variables
##      74 complete cases (96.1%)
## 
##   Nr  ColName   Class      NAs       Levels
##   1   mfr       character  .               
##   2   type      character  .               
##   3   calories  integer    .               
##   4   protein   integer    .               
##   5   fat       integer    .               
##   6   sodium    integer    .               
##   7   fiber     numeric    .               
##   8   carbo     numeric    1 (1.3%)        
##   9   sugars    integer    1 (1.3%)        
##   10  potass    integer    2 (2.6%)        
##   11  vitamins  integer    .               
##   12  shelf     integer    .               
##   13  weight    numeric    .               
##   14  cups      numeric    .               
##   15  rating    numeric    .

For convenience, we can create vectors of our categorical (fac) and numerical (num) variable names. We exclude the rating variable from our num vector, since we will not include it in the PCA.

fac <- c("mfr", "type")
num <- names(cereal)[!names(cereal) %in% c(fac, "rating")]

We can obtain summary statistic information using the summary() function.

summary(cereal)
##      mfr                type              calories        protein     
##  Length:77          Length:77          Min.   : 50.0   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:100.0   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :110.0   Median :3.000  
##                                        Mean   :106.9   Mean   :2.545  
##                                        3rd Qu.:110.0   3rd Qu.:3.000  
##                                        Max.   :160.0   Max.   :6.000  
##                                                                       
##       fat            sodium          fiber            carbo     
##  Min.   :0.000   Min.   :  0.0   Min.   : 0.000   Min.   : 5.0  
##  1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000   1st Qu.:12.0  
##  Median :1.000   Median :180.0   Median : 2.000   Median :14.5  
##  Mean   :1.013   Mean   :159.7   Mean   : 2.152   Mean   :14.8  
##  3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000   3rd Qu.:17.0  
##  Max.   :5.000   Max.   :320.0   Max.   :14.000   Max.   :23.0  
##                                                   NA's   :1     
##      sugars           potass          vitamins          shelf      
##  Min.   : 0.000   Min.   : 15.00   Min.   :  0.00   Min.   :1.000  
##  1st Qu.: 3.000   1st Qu.: 42.50   1st Qu.: 25.00   1st Qu.:1.000  
##  Median : 7.000   Median : 90.00   Median : 25.00   Median :2.000  
##  Mean   : 7.026   Mean   : 98.67   Mean   : 28.25   Mean   :2.208  
##  3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00   3rd Qu.:3.000  
##  Max.   :15.000   Max.   :330.00   Max.   :100.00   Max.   :3.000  
##  NA's   :1        NA's   :2                                        
##      weight          cups           rating     
##  Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :1.00   Median :0.750   Median :40.40  
##  Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :1.50   Max.   :1.500   Max.   :93.70  
## 

We can use the PlotMiss() function to visualize information about missing data.

PlotMiss(x = cereal)

We can replace NA values with the median prior to performing PCA using the caret package.

pp <- preProcess(x = cereal,
                 method = "medianImpute")
cereal <- predict(object = pp, 
                   newdata = cereal)

Since we have a lot of variables, we will use symnum() to look at a symbolic version of our correlation matrix to check for high correlation among our variables.

symnum(cor(x = cereal[ ,num]), 
       corr = TRUE)
##          cl pr ft sd fb cr sg pt v sh w cp
## calories 1                                
## protein     1                             
## fat      .     1                          
## sodium   .        1                       
## fiber       .        1                    
## carbo                .  1                 
## sugars   .     .        .  1              
## potass      .        *  .     1           
## vitamins          .              1        
## shelf                         .    1      
## weight   ,        .        .  .  .    1   
## cups                 .  .     .    .    1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Analysis

We use the prcomp() function to perform PCA on our numeric variables.

pc2 <- prcomp(x = cereal[ ,num], 
              scale. = TRUE) # scale

The summary() function provides information about the variance explained by the PCs.

summary(pc2)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8135 1.6278 1.3215 1.04043 0.98411 0.83250 0.81251
## Proportion of Variance 0.2741 0.2208 0.1455 0.09021 0.08071 0.05775 0.05501
## Cumulative Proportion  0.2741 0.4949 0.6404 0.73063 0.81134 0.86909 0.92411
##                           PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.6609 0.54789 0.30296 0.23927 0.15709
## Proportion of Variance 0.0364 0.02502 0.00765 0.00477 0.00206
## Cumulative Proportion  0.9605 0.98552 0.99317 0.99794 1.00000

The importance component of the summary() output gives us information about how much variance is explained.

var_info2 <- summary(pc2)
var_info2$importance
##                             PC1      PC2      PC3      PC4       PC5       PC6
## Standard deviation     1.813539 1.627814 1.321523 1.040427 0.9841076 0.8324993
## Proportion of Variance 0.274080 0.220810 0.145540 0.090210 0.0807100 0.0577500
## Cumulative Proportion  0.274080 0.494890 0.640430 0.730630 0.8113400 0.8690900
##                              PC7     PC8       PC9      PC10      PC11
## Standard deviation     0.8125136 0.66090 0.5478901 0.3029585 0.2392712
## Proportion of Variance 0.0550100 0.03640 0.0250200 0.0076500 0.0047700
## Cumulative Proportion  0.9241100 0.96051 0.9855200 0.9931700 0.9979400
##                             PC12
## Standard deviation     0.1570939
## Proportion of Variance 0.0020600
## Cumulative Proportion  1.0000000

We can use a Scree Plot to visualize the proportion of variance explained by our PCs, either using the fviz_eig() function from the factoextra package or the screeplot() function from the stats package. Based on the Scree Plot, we have an elbow point at 4, suggesting we should retain 4 PCs.

fviz_eig(X = pc2)

We can also create a Scree Plot that includes the cumulative proportion of variance explained. Here, we can choose a cutoff point for the minimum proportion of variance that we would like to retain. We can choose the minimum number of PCs that meets that criteria.

barplot(var_info2$importance[2, ], # plot PVE 
        names.arg = 1:12,
        ylim=c(0, 1.1),
        main = "Cumulative Variance Explained",
        xlab = "Principal Component",
        ylab = "Proportion of Variance Explained")
lines(var_info2$importance[3, ], # add cumulative PVE
      type = "o",
      pch = 16)
abline(h = .8, col = "red", lty = 2) # add cutoff at .8

Based on the output, we can explain approximately 81% of the variance by retaining PC1 - PC5.

Describing the Solution

PC loading information is available for each of the variables and each of the PCs. It is contained in the rotation component of an object created using the prcomp() function. We can focus on the first 5 PCs by subsetting the column dimension.

pc2$rotation[ ,1:5] # Loadings for 1st 5 PCs
##                  PC1         PC2         PC3         PC4         PC5
## calories  0.04453048 -0.56612303  0.05649216 -0.26918354  0.08019453
## protein   0.31814974  0.08245802 -0.31214924 -0.46234688  0.17684923
## fat       0.22198380 -0.24054462  0.29729134 -0.35158848  0.45689426
## sodium   -0.04818453 -0.31006185 -0.33249151  0.12199200 -0.31904935
## fiber     0.45710646  0.20873834 -0.18216653  0.06793186 -0.23649828
## carbo    -0.27613247 -0.15246418 -0.52237589 -0.20978935  0.13921704
## sugars    0.09961866 -0.37583984  0.48444020  0.15010799 -0.30853389
## potass    0.50191850  0.08287566 -0.14764210 -0.02704716 -0.15860250
## vitamins  0.02078775 -0.30802816 -0.32568194  0.52762016  0.19613194
## shelf     0.29603622 -0.07784817 -0.01676226  0.45779825  0.58030569
## weight    0.26286896 -0.43744710 -0.15546873 -0.09345034 -0.27795633
## cups     -0.37548375 -0.10706400 -0.08588833 -0.08181433  0.06004537

Based on the output, we have the following PC loadings:

  • PC1: Fiber, Potass, Cups
  • PC2: Calories, Weight
  • PC3: Sodium, Carbo, Sugars
  • PC4: Protein, Vitamins
  • PC5: Fat, Shelf