Here I try to replicate in R the PCA on the Iris data set as shown in the Columbus Machine Learners presentation by Charles Bergeron on June 6, 2018.
library(tidyverse)
## -- Attaching packages ---------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
First let’s take a look at this using the function prcomp. The scale function is used to center the values, but with the parameter scale = FALSE because we don’t want to actually change the magnatude of the values.
iris_pca <- iris %>%
mutate_at(vars(-Species), scale, center = TRUE, scale = FALSE) %>%
prcomp(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, .)
The Proportion of Variance looks to be about what we saw in the presentation.
summary(iris_pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
The biplot function for prcomp isn’t the easiest to read, and no coloration by Species here, but I think some of the same clustering is apparent. The dual is also on this plot and looks like it may be similar, but it is hard to read. (The y-axis PC2 is flipped, but the orientation is arbitrary so it doesn’t really matter.)
biplot(iris_pca)
Now let’s try it via Singular Value Decomposition. I use the svd() function to compute the decomposition and then pull out the \(U\) and \(V\) matrices as data frames. Note that although the SVD formulation for \(X\) is \(USV^T\), we want the columns of \(V\), not \(V^T\). This time we temporarily drop the Species variable because svd doesn’t accept formula notation.
iris_svd <- iris %>%
select(-Species) %>%
mutate_all(scale, center = TRUE, scale = FALSE) %>%
svd
u <- as.data.frame(iris_svd$u)
names(u) <- c('PC1', 'PC2', 'PC3', 'PC4')
u <- cbind(u, iris['Species'])
v <- as.data.frame(iris_svd$v)
names(v) <- c('PC1', 'PC2', 'PC3', 'PC4')
v$feature <- names(iris)[-5]
This plot looks much more like what we saw in the presentation. In this case I’ve flipped the y-axis to make it even more similar.
ggplot(u, aes(x = PC1, y = -PC2, color = Species)) +
geom_point() +
xlim(-0.3, 0.3) +
ylim(-0.3, 0.3)
The dual looks similar to what we saw also, minus the dashed-line circle.
ggplot(v, aes(x = PC1, y = -PC2)) +
geom_point() +
geom_text(aes(label = feature), vjust = 1.2) +
xlim(-1, 1) +
ylim(-1, 1)