Principal component analysis, or PCA, is a common approach to dimensionality reduction. Learn exactly what PCA does, visualize the results of PCA with biplots and scree plots, and deal with practical issues such as centering and scaling the data before performing PCA.

library(readr)
library(dplyr)
library(ggplot2)
library(stringr)

3.1: PCA using prcomp()

In this exercise, you will create your first PCA model and observe the diagnostic results.

We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.

Instructions

100 XP

pokemon<-read.csv("Pokemon.csv")
pokemon_pr <- pokemon %>% select(HP, Attack, Defense, Speed)
glimpse(pokemon_pr)
Observations: 800
Variables: 4
$ HP      <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44, 59, 79, 79, 45, 50, 60, 40, 45, 65, 65, 40...
$ Attack  <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, 48, 63, 83, 103, 30, 20, 45, 35, 25, 90, 15...
$ Defense <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 65, 80, 100, 120, 35, 55, 50, 30, 50, 40, 40...
$ Speed   <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, 43, 58, 78, 78, 45, 30, 70, 50, 35, 75, 145...
summary(pokemon_pr)
       HP             Attack       Defense           Speed       
 Min.   :  1.00   Min.   :  5   Min.   :  5.00   Min.   :  5.00  
 1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00   1st Qu.: 45.00  
 Median : 65.00   Median : 75   Median : 70.00   Median : 65.00  
 Mean   : 69.26   Mean   : 79   Mean   : 73.84   Mean   : 68.28  
 3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00   3rd Qu.: 90.00  
 Max.   :255.00   Max.   :190   Max.   :230.00   Max.   :180.00  
pr.out <- prcomp(x = pokemon_pr, scale = TRUE, center = TRUE)
summary(pr.out)
Importance of components:
                          PC1    PC2    PC3    PC4
Standard deviation     1.3721 0.9933 0.8526 0.6354
Proportion of Variance 0.4707 0.2467 0.1817 0.1009
Cumulative Proportion  0.4707 0.7173 0.8991 1.0000
biplot(pr.out)

PCbiplot(pr.pokemon)
Error in PCbiplot(pr.pokemon) : could not find function "PCbiplot"

Remark: Attack & HP variables have approximately the same loadings in the first two principal components (similar directions)

3.2: Variance explained

The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.

In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.

Instructions

100 XP

# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve
[1] 0.4706937 0.2466505 0.1817326 0.1009233

3.3: Visualize variance explained

Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.

Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.

Instructions

100 XP

The proportion of variance explained is still available in the pve object you created in the last exercise.

Use plot() to plot the proportion of variance explained by each principal component.

Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

3.4: Practical issues: scaling

You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.

Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.

Instructions

100 XP

# Mean of each variable
pokemon_new<-read.csv("new_pokemon.csv")
colMeans(pokemon_new[,2:6])
    Total HitPoints    Attack   Defense     Speed 
   448.82     71.08     81.22     78.44     66.58 
# Standard deviation of each variable
apply(pokemon_new[,2:6], 2, sd)
    Total HitPoints    Attack   Defense     Speed 
119.32321  25.62193  33.03078  32.05809  27.51036 
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon_new[,2:6], scale = T, center =T)
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon_new[,2:6], scale = F, center = T)
# Create biplots of both for comparison
biplot(pr.with.scaling)

biplot(pr.without.scaling)

Remark: The new Total column contains much more variation, on average, than the other four columns, so it has a disproportionate effect on the PCA model when scaling is not performed. After scaling the data, there’s a much more even distribution of the loading vectors.

