library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: grid
glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...
Here, we set center and scale to ‘TRUE’ so R will center and scale our data (i.e., standardize our data).
You can see there are 9 principal components, with each explaining a proportion of the variability in the data. For example, PC1 explains 62.8% of the total variance: that is, almost two-thirds of the variance in the entire dataset can be explained by the first principal component.
cars.pca <- prcomp(mtcars[,c(1:7,10:11)],
center=TRUE,
scale=TRUE)
summary(cars.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
## PC8 PC9
## Standard deviation 0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion 0.9975 1.00000
The biplot is used to visualize principal components. This plots the first and second principal components. The closer the variable is the the center, the less contribution that variable has to either principal component.
You can see that hp, cyl, and disp contribute to the first principal component because they are farthest to the right. But, this plot isn’t informative because we just have the relationship of each variable to the axes.
ggbiplot(cars.pca)
Here, we added labels to the points to see the car to which each point corresponds.
ggbiplot(cars.pca, labels=rownames(mtcars))
We can also add a group for origin and you will now see that the American-made cars cluster to the right and are generally characterized by high values of cyl (# cylinders), disp (displacement), and wt (weight).
Japanese cars tend to be characterized by high mpg (miles per gallon).
origin <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))
ggbiplot(cars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=origin)
We aim to find the components with the maximum variance so we can retain as much information about the original dataset as possible.
To determine the number of principal components to retain in our analysis we need to compute the proportion of variance explained.
We can plot the cumulative proportion of variance explained in a scree plot to determine the number of principal components to retain.
# Pull SD
sd.cars <- cars.pca$sdev
# Compute variance
var.cars <- sd.cars^2
# Proportion of variance explained
prop.cars <- var.cars/sum(var.cars)
# Plot the cumulative proportion of variance explained
plot(cumsum(prop.cars), xlab='Principal components', ylab='Proportion of variance explained', type='b')
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggbiplot)
glimpse(mtcars)
cars.pca <- prcomp(mtcars[,c(1:7,10:11)],
center=TRUE,
scale=TRUE)
summary(cars.pca)
ggbiplot(cars.pca)
ggbiplot(cars.pca, labels=rownames(mtcars))
origin <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))
ggbiplot(cars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=origin)
# Pull SD
sd.cars <- cars.pca$sdev
# Compute variance
var.cars <- sd.cars^2
# Proportion of variance explained
prop.cars <- var.cars/sum(var.cars)
# Plot the cumulative proportion of variance explained
plot(cumsum(prop.cars), xlab='Principal components', ylab='Proportion of variance explained', type='b')