1 Take a quick look at the dataset

2 Compute the principal components

3 Plot Principal Components

3.1 Add labels to the points
3.2 Add country of origin groupings

4 How can we determine the number of principal components?

5 Code Appendix

1 Take a quick look at the dataset

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggbiplot)

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

## Loading required package: scales

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

## Loading required package: grid

glimpse(mtcars)

## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...

2 Compute the principal components

Here, we set center and scale to ‘TRUE’ so R will center and scale our data (i.e., standardize our data).

You can see there are 9 principal components, with each explaining a proportion of the variability in the data. For example, PC1 explains 62.8% of the total variance: that is, almost two-thirds of the variance in the entire dataset can be explained by the first principal component.

cars.pca <- prcomp(mtcars[,c(1:7,10:11)], 
                     center=TRUE,
                     scale=TRUE)
summary(cars.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

3 Plot Principal Components

The biplot is used to visualize principal components. This plots the first and second principal components. The closer the variable is the the center, the less contribution that variable has to either principal component.

You can see that hp, cyl, and disp contribute to the first principal component because they are farthest to the right. But, this plot isn’t informative because we just have the relationship of each variable to the axes.

ggbiplot(cars.pca)

3.1 Add labels to the points

Here, we added labels to the points to see the car to which each point corresponds.

ggbiplot(cars.pca, labels=rownames(mtcars))

3.2 Add country of origin groupings

We can also add a group for origin and you will now see that the American-made cars cluster to the right and are generally characterized by high values of cyl (# cylinders), disp (displacement), and wt (weight).

Japanese cars tend to be characterized by high mpg (miles per gallon).

origin <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))

ggbiplot(cars.pca,ellipse=TRUE,  labels=rownames(mtcars), groups=origin)

4 How can we determine the number of principal components?

We aim to find the components with the maximum variance so we can retain as much information about the original dataset as possible.

To determine the number of principal components to retain in our analysis we need to compute the proportion of variance explained.

We can plot the cumulative proportion of variance explained in a scree plot to determine the number of principal components to retain.

# Pull SD
sd.cars <- cars.pca$sdev

# Compute variance
var.cars <- sd.cars^2

# Proportion of variance explained
prop.cars <- var.cars/sum(var.cars)

# Plot the cumulative proportion of variance explained
plot(cumsum(prop.cars), xlab='Principal components', ylab='Proportion of variance explained', type='b')

$~$

5 Code Appendix

$~$

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(ggbiplot)
glimpse(mtcars)
cars.pca <- prcomp(mtcars[,c(1:7,10:11)], 
                     center=TRUE,
                     scale=TRUE)
summary(cars.pca)
ggbiplot(cars.pca)
ggbiplot(cars.pca, labels=rownames(mtcars))
origin <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))

ggbiplot(cars.pca,ellipse=TRUE,  labels=rownames(mtcars), groups=origin)

# Pull SD
sd.cars <- cars.pca$sdev

# Compute variance
var.cars <- sd.cars^2

# Proportion of variance explained
prop.cars <- var.cars/sum(var.cars)

# Plot the cumulative proportion of variance explained
plot(cumsum(prop.cars), xlab='Principal components', ylab='Proportion of variance explained', type='b')

Principal Component Analysis

J Kyle Armstrong, PhD

08 April 2020