A primer on principle component analysis

Principle component analysis (PCA) is often touted as a means of learning the intrinsic structure of a dataset. It takes the columns/variables of a dataset and produces linear combinations of those columns to create new, mutually orthogonal dimensions called principle components. Each principle component is a mixture of the original variables of the dataset, and as such tend to be physically interpretable. Here, we walk through some basic elements of PCA.

In many statistics courses, you will be taugt that PCA is done by taking an eigendecomposition of the correlation matrix of a dataset. For reasons of computational stability, PCA tends to be performed a little bit differently; many implementations use the singular value decomposition on the data matrix instead. Let’s try it on the famous iris dataset in R.

# Using a standard R function
pca <- prcomp(iris[, 1:4], scale. = TRUE)
# Using the SVD (the same thing)
scaled_iris <- apply(iris[,1:4], 2, function(x){(x - mean(x))/sd(x)})
pc2 <- svd(scaled_iris)

# Note that our loadings matrices are the same
pca$rotation
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
pc2$v
##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762  0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199 -0.6342727  0.5235971

The code makes reference to the ‘loadings’ matrix. This is just the matrix containing the linear combinations we apply to the original data variables to produce the principle components. For example, in the matrix below, reading down the columns, we see that the first principle component is made up of \(0.52 \times Sepal~Length - 0.27 \times Sepal~Width+0.58\times Petal~Length + 0.56 \times Petal ~ Width\). The other principle components are interpreted similarly. We obtain the principle components by multiplying the original data matrix against these loadings. In linear algebra terms, this multiplication is a rotation, which is why PCA is often described as being a rotation of the data.

pca$rotation
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

The actual principle components look like this:

head(pca$x)
##            PC1        PC2         PC3          PC4
## [1,] -2.257141 -0.4784238  0.12727962  0.024087508
## [2,] -2.074013  0.6718827  0.23382552  0.102662845
## [3,] -2.356335  0.3407664 -0.04405390  0.028282305
## [4,] -2.291707  0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825  0.006586116

The fact that we obtained the principle components via eigenanalysis guarantees that they are mutually uncorrelated. This property can be very helpful in applications where the original data are highly correlated but are required as inputs for a model. High correlation among covariates can interfere with model-fitting, so it can be helpful to instead use the principle components as covariates.

with(pca, round(cor(x), 4))
##     PC1 PC2 PC3 PC4
## PC1   1   0   0   0
## PC2   0   1   0   0
## PC3   0   0   1   0
## PC4   0   0   0   1

It is common practice to plot the principle components against each other when trying to identify natural clusters in the data. Often only the first two are plotted, but for \(n\) principle components a total of \(\sum_{i=1}^{n-1}i\) pairwise plots can be produced.

par(mfrow = c(2, 3))
with(
  pca,
  {
    plot(x[, c(1, 2)], asp = 1)
    plot(x[, c(1, 3)], asp = 1)
    plot(x[, c(1, 4)], asp = 1)
    plot(x[, c(2, 3)], asp = 1)
    plot(x[, c(2, 4)], asp = 1)
    plot(x[, c(3, 4)], asp = 1)
  }
)

Not all principle components are made equal. Each principle component is oriented such that it explains some amount of variation in the original dataset. It turns out the amount of variation each explains is equal to how large its associated eigenvalue is as a proportion of the sum of all the principle components’ eigenvalues. (Remember, we obtain the principle components via eigenanalysis so each principle component is an eigenvector and each eigenvector has an eigenvalue.) The smaller this number is, the less important the principle component is for explaining total variation in the dataset. (However, this does not mean it is totally useless; especially when used for clustering, the smaller principle components can have massive value.) From R, we get the amount of variation explained by each principle component as

# Square these because the SVD yields the square roots
# of the eigenvalues
eigenvalues <- pca$sdev^2
# Compute percentage variation explained
data.frame(
  `Principle component` = 1:4,
  `Variation explained` = 100 * eigenvalues / sum(eigenvalues)
)
##   Principle.component Variation.explained
## 1                   1          72.9624454
## 2                   2          22.8507618
## 3                   3           3.6689219
## 4                   4           0.5178709

That’s our primer on PCA done. It’s quite full on. Now the fun part.

As applied to images

Images are a common application area for PCA. There are a few ways to apply PCA to images. The one we deal with here is for multiband (e.g. colour) images. A colour image has red, green and blue bands. The combination of these colour bands produces the single, cohesive image that we see. However, there is another way to think of this image data. We can think of the image as an \(m \times k\) matrix where each of the \(m\) rows is a pixel which has a value in each of \(k\) dimensions or variables. This means we can apply PCA, yielding a maximum of \(k\) principle components. It can be useful to do this because

  1. We can sometimes learn things about the intrinsic structure of the image, and important features of it (e.g. in facial recognition); and
  2. Although the RGB colour system encodes a lot of redundant information, the principle components computed from them all encode independent information.

An example with a cat

It’s hard to appreciate the above abstractly. Here, we will actually perform PCA on an image. This image is a picture I took of a cat that I used to live with. She is a very beautiful blue burmese called Pepper.

Let’s import the image into R.

# Load the raster package
library(raster)
## Loading required package: sp
# Import the picture as a raster
pepper <- stack("Pepper.PNG")[[1:3]]

# Plot it
plotRGB(pepper)

We can plot each of the RGB bands separately, if it interests us. Here is the blue band (plotted in blue).

plot(
  pepper[[3]], 
  col = rev(blues9), 
  asp = 1, 
  axes = FALSE,
  legend = FALSE
)

To perform PCA, we can’t leave it in image format. We want to create a data.frame with each pixel in a single row and each colour band as a column.

# Extract all pixels as data frame
pepper.df <- as.data.frame(pepper)
# Have a quick look
head(pepper.df)
##   Pepper.1 Pepper.2 Pepper.3
## 1      210      211      213
## 2      193      194      196
## 3      189      190      192
## 4      200      201      203
## 5      214      215      217
## 6      219      220      222

Now we can peform PCA.

# Compute the principle components
# Remember to scale the variables
pca <- prcomp(pepper.df, scale. = TRUE)
# Save the principle components
pepper.pc <- pca$x
pepper$pc1 <- pepper.pc[,1]
pepper$pc2 <- pepper.pc[,2]
pepper$pc3 <- pepper.pc[,3]

The interesting thing about these principle components is that each pixel in the image has a value of each of the principle components, so we can plot an image of the principle components. This is very cool. See below!

plot(pepper$pc1, col = cm.colors(15), axes = FALSE)

plot(pepper$pc2, col = cm.colors(15), axes = FALSE)

plot(pepper$pc3, col = cm.colors(15), axes = FALSE)

The first principle component image is basically the original image. But the other two are very interesting. The second principle component picks out my hand – a distinct part of the image that isn’t white-grey-black. The third principle component picks out Pepper’s collar, which is another distinct part of the image.

Before we proceed further to look at an application of this in image segmentation, we should have a look at the principle components and what each one represents in terms of the original data. Here is the loadings matrix.

pca$rotation
##                 PC1        PC2        PC3
## Pepper.1 -0.5719742 -0.7730972 -0.2741646
## Pepper.2 -0.5836163  0.1486844  0.7983013
## Pepper.3 -0.5764005  0.6166147 -0.5362358

I can offer a brief interpretation of these, especially in light of what we have just seen.

That’s very cool! We have indeed learned the internal structure of the image.

One final note is how we might use this principle component stuff to do image analysis. One thing we can do with this image is to segment the image. For example, let’s assume we want to pick out which parts of this image have skin and which parts don’t. We can do this quite efficiently by using the second principle component only and training some sort of classifier on it. With k-means, it might look something like this:

# Train k-means on the second principle component
# Two classes only because we only need skin vs non-skin.
pepper$class.pc2 <- kmeans(pepper.pc[,2], 2)$cluster
# Plot the result
plot(
  pepper$class.pc2, 
  col = cm.colors(2), 
  axes = FALSE,
  legend = FALSE
)

Very nice! The only part of the image that gets ‘misclassified’ is the little bit of skin showing through thin fur in Pepper’s ear.

Summary

PCA is a useful technique to learn the intrinsic structure of data. It can be extended to multiband images, such as colour images. In images, PCA is particularly interesting because we can clearly visualise the results in image format.