Iris PCA demo

Here I try to replicate in R the PCA on the Iris data set as shown in the Columbus Machine Learners presentation by Charles Bergeron on June 6, 2018.

library(tidyverse)

## -- Attaching packages ---------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)

First let’s take a look at this using the function prcomp. The scale function is used to center the values, but with the parameter scale = FALSE because we don’t want to actually change the magnatude of the values.

iris_pca <- iris %>%
  mutate_at(vars(-Species), scale, center = TRUE, scale = FALSE) %>%
  prcomp(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, .)

The Proportion of Variance looks to be about what we saw in the presentation.

summary(iris_pca)

## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

The biplot function for prcomp isn’t the easiest to read, and no coloration by Species here, but I think some of the same clustering is apparent. The dual is also on this plot and looks like it may be similar, but it is hard to read. (The y-axis PC2 is flipped, but the orientation is arbitrary so it doesn’t really matter.)

biplot(iris_pca)

Now let’s try it via Singular Value Decomposition. I use the svd() function to compute the decomposition and then pull out the \(U\) and \(V\) matrices as data frames. Note that although the SVD formulation for \(X\) is \(USV^T\), we want the columns of \(V\), not \(V^T\). This time we temporarily drop the Species variable because svd doesn’t accept formula notation.

iris_svd <- iris %>%
  select(-Species) %>%
  mutate_all(scale, center = TRUE, scale = FALSE) %>%
  svd
  
u <- as.data.frame(iris_svd$u)
names(u) <- c('PC1', 'PC2', 'PC3', 'PC4')
u <- cbind(u, iris['Species'])

v <- as.data.frame(iris_svd$v)
names(v) <- c('PC1', 'PC2', 'PC3', 'PC4')
v$feature <- names(iris)[-5]

This plot looks much more like what we saw in the presentation. In this case I’ve flipped the y-axis to make it even more similar.

ggplot(u, aes(x = PC1, y = -PC2, color = Species)) +
  geom_point() +
  xlim(-0.3, 0.3) +
  ylim(-0.3, 0.3)

The dual looks similar to what we saw also, minus the dashed-line circle.

ggplot(v, aes(x = PC1, y = -PC2)) +
  geom_point() +
  geom_text(aes(label = feature), vjust = 1.2) +
  xlim(-1, 1) +
  ylim(-1, 1)

Iris PCA demo

Brian Stamper

June 11, 2018