Principal component analysis, or PCA, is a common approach to dimensionality reduction. Learn exactly what PCA does, visualize the results of PCA with biplots and scree plots, and deal with practical issues such as centering and scaling the data before performing PCA.
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
3.1: PCA using prcomp()
In this exercise, you will create your first PCA model and observe the diagnostic results.
We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.
Instructions
100 XP
Create a PCA model of the data in pokemon, setting scale to TRUE. Store the result in pr.out.
Inspect the result with the summary() function.
pokemon<-read.csv("Pokemon.csv")
pokemon_pr <- pokemon %>% select(HP, Attack, Defense, Speed)
glimpse(pokemon_pr)
Observations: 800
Variables: 4
$ HP <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44, 59, 79, 79, 45, 50, 60, 40, 45, 65, 65, 40...
$ Attack <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, 48, 63, 83, 103, 30, 20, 45, 35, 25, 90, 15...
$ Defense <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 65, 80, 100, 120, 35, 55, 50, 30, 50, 40, 40...
$ Speed <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, 43, 58, 78, 78, 45, 30, 70, 50, 35, 75, 145...
summary(pokemon_pr)
HP Attack Defense Speed
Min. : 1.00 Min. : 5 Min. : 5.00 Min. : 5.00
1st Qu.: 50.00 1st Qu.: 55 1st Qu.: 50.00 1st Qu.: 45.00
Median : 65.00 Median : 75 Median : 70.00 Median : 65.00
Mean : 69.26 Mean : 79 Mean : 73.84 Mean : 68.28
3rd Qu.: 80.00 3rd Qu.:100 3rd Qu.: 90.00 3rd Qu.: 90.00
Max. :255.00 Max. :190 Max. :230.00 Max. :180.00
pr.out <- prcomp(x = pokemon_pr, scale = TRUE, center = TRUE)
summary(pr.out)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.3721 0.9933 0.8526 0.6354
Proportion of Variance 0.4707 0.2467 0.1817 0.1009
Cumulative Proportion 0.4707 0.7173 0.8991 1.0000
biplot(pr.out)

PCbiplot(pr.pokemon)
Error in PCbiplot(pr.pokemon) : could not find function "PCbiplot"
Remark: Attack & HP variables have approximately the same loadings in the first two principal components (similar directions)
3.2: Variance explained
The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.
In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.
Instructions
100 XP
pr.out and the pokemon data are still available in your workspace.
Assign to the variable pr.var the square of the standard deviations of the principal components (i.e. the variance). The standard deviation of the principal components is available in the sdev component of the PCA model object.
Assign to the variable pve the proportion of the variance explained, calculated by dividing pr.var by the total variance explained by all principal components.
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve
[1] 0.4706937 0.2466505 0.1817326 0.1009233
3.3: Visualize variance explained
Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.
Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.
Instructions
100 XP
The proportion of variance explained is still available in the pve object you created in the last exercise.
Use plot() to plot the proportion of variance explained by each principal component.
Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

3.4: Practical issues: scaling
You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.
Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.
Instructions
100 XP
The same Pokemon dataset is available in your workspace as pokemon, but one new variable has been added: Total.
There is some code at the top of the editor to calculate the mean and standard deviation of each variable in the model. Run this code to see how the scale of the variables differs in the original data.
Create a PCA model of pokemon with scaling, assigning the result to pr.with.scaling.
Create a PCA model of pokemon without scaling, assigning the result to pr.without.scaling.
Use biplot() to plot both models (one at a time) and compare their outputs.
# Mean of each variable
pokemon_new<-read.csv("new_pokemon.csv")
colMeans(pokemon_new[,2:6])
Total HitPoints Attack Defense Speed
448.82 71.08 81.22 78.44 66.58
# Standard deviation of each variable
apply(pokemon_new[,2:6], 2, sd)
Total HitPoints Attack Defense Speed
119.32321 25.62193 33.03078 32.05809 27.51036
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon_new[,2:6], scale = T, center =T)
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon_new[,2:6], scale = F, center = T)
# Create biplots of both for comparison
biplot(pr.with.scaling)

biplot(pr.without.scaling)

Remark: The new Total column contains much more variation, on average, than the other four columns, so it has a disproportionate effect on the PCA model when scaling is not performed. After scaling the data, there’s a much more even distribution of the loading vectors.
---
title: "Datacamp R - Unsupervised Learning in R : Chapter 3 (Dimensionality reduction with PCA)"
author: "Chen Weiqiang"
date: "November 28, 2018"
output: html_notebook
---

Principal component analysis, or PCA, is a common approach to dimensionality reduction. Learn exactly what PCA does, visualize the results of PCA with biplots and scree plots, and deal with practical issues such as centering and scaling the data before performing PCA.

```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
```

# 3.1: PCA using prcomp()

In this exercise, you will create your first PCA model and observe the diagnostic results.

We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.

Instructions

100 XP

- Create a PCA model of the data in pokemon, setting scale to TRUE. Store the result in pr.out.

- Inspect the result with the summary() function.
```{r}
pokemon<-read.csv("Pokemon.csv")
pokemon_pr <- pokemon %>% select(HP, Attack, Defense, Speed)
glimpse(pokemon_pr)
```

```{r}
summary(pokemon_pr)
```

```{r}
pr.out <- prcomp(x = pokemon_pr, scale = TRUE, center = TRUE)
summary(pr.out)
biplot(pr.out)
```
Remark:  Attack & HP variables have approximately the same loadings in the first two principal components (similar directions)


# 3.2: Variance explained

The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.

In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.

Instructions

100 XP

- pr.out and the pokemon data are still available in your workspace.


- Assign to the variable pr.var the square of the standard deviations of the principal components (i.e. the variance). The standard deviation of the principal components is available in the sdev component of the PCA model object.

- Assign to the variable pve the proportion of the variance explained, calculated by dividing pr.var by the total variance explained by all principal components.

```{r}
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve
```

# 3.3: Visualize variance explained

Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.

Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.

Instructions

100 XP

The proportion of variance explained is still available in the pve object you created in the last exercise.

Use plot() to plot the proportion of variance explained by each principal component.

Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.

```{r}
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")
```

# 3.4: Practical issues: scaling

You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.

Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.

Instructions

100 XP

- The same Pokemon dataset is available in your workspace as pokemon, but one new variable has been added: Total.

- There is some code at the top of the editor to calculate the mean and standard deviation of each variable in the model. Run this code to see how the scale of the variables differs in the original data.

- Create a PCA model of pokemon with scaling, assigning the result to pr.with.scaling.

- Create a PCA model of pokemon without scaling, assigning the result to pr.without.scaling.

- Use biplot() to plot both models (one at a time) and compare their outputs.
```{r}
# Mean of each variable
pokemon_new<-read.csv("new_pokemon.csv")
colMeans(pokemon_new[,2:6])

# Standard deviation of each variable
apply(pokemon_new[,2:6], 2, sd)

# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon_new[,2:6], scale = T, center =T)


# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon_new[,2:6], scale = F, center = T)

# Create biplots of both for comparison
biplot(pr.with.scaling)
biplot(pr.without.scaling)
```

Remark: The new Total column contains much more variation, on average, than the other four columns, so it has a disproportionate effect on the PCA model when scaling is not performed. After scaling the data, there's a much more even distribution of the loading vectors.