In real world data analysis we are often faced with high dimensionality in terms of measurement variables. For example, for every transaction we might have time of transation, amount of transaction, location of transaction, etc. In practice many of these measurements are either redundant (by way of correlations for example) or not relevant. Therefore, while the data has been obtained in a high dimensional vector space, the relevant bits of the data are in a low dimensional subspace or manifold. There have been many techniques developed for finding the relevant bit of high dimensional data. Here we discuss one of the most popular tools: Principle Component Analysis (PCA).
We illustrate our example by generating synethic data that has very interesting clusters. Here we generate three clusters in a two-dimensional space.
N = 1e3 ## sample size
## setup cluster 1
X1 = rnorm(N, mean = 0, sd = 1)
X2 = 2*X1 + rnorm(N, mean = 0, sd = .2)
clust.1 = as.data.frame(cbind(X1, X2))
clust.1$cluster = "cluster 1"
## setup cluster 2
X1 = rnorm(N, mean = 10, sd = 1)
X2 = -3*X1 + 30 + rnorm(N, mean = 0, sd = .5)
clust.2 = as.data.frame(cbind(X1,X2))
clust.2$cluster = "cluster 2"
## setup cluster 3
X1 = rnorm(N, mean = -1, sd = 1)
X2 = -2*X1 + rnorm(N, mean = 0, sd = .9)
clust.3 = as.data.frame(cbind(X1, X2))
clust.3$cluster = "cluster 3"
## setup data
X = rbind(clust.1, clust.2, clust.3)
The two-dimensional data that we have created is shown below in a simple scatter plot:
library(ggplot2)
ggplot(X, aes(x = X1, y = X2, colour = cluster)) + geom_point()
Note that this simple scatter plot is very effective at revealing the structure of the clusters in our two-dimensional toy data set! Unfortunately, in real life data never looks so perfect and we have far more measurements than just two. Let’s create some irrelevant measurements. In this case, we will just use normal random variables for our additional measurements.
noise.measurements = data.frame( X3 = rnorm(nrow(X), mean = 0, sd = .1),
X4 = rnorm(nrow(X), mean = 0, sd = .1),
X5 = rnorm(nrow(X), mean = 2, sd = .1),
X6 = rnorm(nrow(X), mean = 0, sd = .5),
X7 = rnorm(nrow(X), mean = -20, sd = .1))
X.high.dim = cbind(X, noise.measurements)
head(X.high.dim)
## X1 X2 cluster X3 X4 X5
## 1 -1.7957877 -3.3372732 cluster 1 0.080407328 -0.10029174 1.991634
## 2 1.2050282 2.4461670 cluster 1 0.023725489 -0.08277757 2.066074
## 3 -0.6076655 -1.1671989 cluster 1 0.021810609 0.03458204 1.985655
## 4 0.6193193 0.7765847 cluster 1 -0.003444616 0.00574662 2.099485
## 5 0.8411653 1.7031921 cluster 1 0.109122522 0.10799262 2.039890
## 6 0.8018945 1.9704643 cluster 1 -0.041402834 -0.07297691 2.042977
## X6 X7
## 1 -0.6614068 -20.14552
## 2 0.5346990 -19.99500
## 3 0.4933627 -20.04390
## 4 -0.4545440 -20.18974
## 5 0.2361324 -20.08009
## 6 0.8399814 -19.94611
I randomly selected the means for these noise measurements and purposefully selected the variances to be “small.” Our data set now has 7 measurements but only two of them are actuall useful. Of course, we constructed this data set so we knew before hand that this is indeed the case and exactly which two variables are relevant. In practice this is much harder to know but we can illustrate how in this case PCA will reveal to use the relevant subspace.
pca.result <- prcomp(as.matrix(X.high.dim[,-3]), scale = FALSE)
pca.df = as.data.frame(pca.result$x)
pca.df$cluster = X.high.dim$cluster
ggplot(pca.df, aes(x = PC1, y = PC2, colour = cluster)) + geom_point()
The above figure is a scatter plot of the top two PC’s of our 7-dimensional data set and we can see in this case that the original clusters were found perfectly, minus (since PCA is rotational insesetive). If we look at the loadings of the Pricniple Directions (the eigenvectors), we see that indeed the largest loadings are associated with X1 and X2 (the two components of our original data that has actual structure):
plot(100*cumsum(pca.result$sdev/sum(pca.result$sdev)), pch = 19,
ylab = "Percent Variance Explained", xlab = "Principle Component")
pca.result$rotation[,c("PC1", "PC2")]
## PC1 PC2
## X1 9.820629e-01 0.1884956009
## X2 -1.884916e-01 0.9820728559
## X3 -3.191266e-05 0.0011025382
## X4 4.650467e-04 0.0001105043
## X5 8.209444e-05 0.0004049522
## X6 -4.805237e-03 0.0004800889
## X7 -3.069073e-04 -0.0008317638
This was a very easy data set designed specifically to work well with PCA. If one of our noise measurements is too noisy, ie, the variance is large, then PCA will have trouble finding the clusters. Here we illustrate this issue by increasing the variance of one of our noise measurements.
noise.measurements = data.frame( X3 = rnorm(nrow(X), mean = 0, sd = .1),
X4 = rnorm(nrow(X), mean = 0, sd = .1),
X5 = rnorm(nrow(X), mean = 2, sd = .1),
X6 = rnorm(nrow(X), mean = 0, sd = 10),
X7 = rnorm(nrow(X), mean = -20, sd = .1))
X.high.dim = cbind(X, noise.measurements)
head(X.high.dim)
## X1 X2 cluster X3 X4 X5
## 1 -1.7957877 -3.3372732 cluster 1 -0.054106768 -0.04172511 2.081470
## 2 1.2050282 2.4461670 cluster 1 0.003378462 0.15682638 2.115784
## 3 -0.6076655 -1.1671989 cluster 1 0.113895020 0.03279528 1.983742
## 4 0.6193193 0.7765847 cluster 1 -0.153768013 0.04512010 2.194018
## 5 0.8411653 1.7031921 cluster 1 -0.100932143 -0.05628544 2.197096
## 6 0.8018945 1.9704643 cluster 1 -0.063193527 -0.03270714 1.975130
## X6 X7
## 1 14.080637 -19.88594
## 2 -18.120711 -19.95886
## 3 -13.415219 -19.91427
## 4 15.716931 -20.03932
## 5 -5.813199 -19.95159
## 6 -10.403403 -20.13627
Again, the only relevant part of this data are X1 and X2, but in real life we would not know this. Note that X6 now has a very large variance and since PCA finds the subspace with largest variance, we are bound to run into trouble. If we run the same PCA aganlysis again and look at the top two PCs, we have the following:
pca.result <- prcomp(as.matrix(X.high.dim[,-3]), scale = FALSE)
pca.df = as.data.frame(pca.result$x)
pca.df$cluster = X.high.dim$cluster
ggplot(pca.df, aes(x = PC1, y = PC2, colour = cluster)) + geom_point()
As can be seen, in this analysis it appears that we only have two clusters instead of the original three that we had. Taking a look at the loadings as before we have:
pca.result$rotation[,c("PC1", "PC2")]
## PC1 PC2
## X1 5.312643e-05 -9.820743e-01
## X2 -4.900824e-04 1.884927e-01
## X3 9.714257e-05 -6.093872e-05
## X4 -1.379200e-04 -2.257973e-04
## X5 -3.430470e-04 -5.669019e-04
## X6 -9.999998e-01 -1.443111e-04
## X7 6.174213e-05 3.295380e-04
As you can see, the 6th component is taking a strong loading and hence obfuscates the cluster structures that we see. Most often we look at the top PC’s, but in this case the top PC is only giving us noise and is not interesting.
I often advocate to also look at the other PC’s besides the top two, since those dimensions may also provide us hidden structure. Because we have already computed all the PC, we might as well look at the other ones to get an idea for about the more nuanced details of the data.
ggplot(pca.df, aes(x = PC2, y = PC3, colour = cluster)) + geom_point()
Thus, looking at the higher order PC’s, we see that even in this case the original clusters can be found. Looking at the magnitudes of the loadings of PC2 and PC3, we can see the relevant subset of the data is selected.
pca.result$rotation[,c("PC2", "PC3")]
## PC2 PC3
## X1 -9.820743e-01 -0.1884920531
## X2 1.884927e-01 -0.9820738737
## X3 -6.093872e-05 -0.0001266521
## X4 -2.257973e-04 0.0003059037
## X5 -5.669019e-04 -0.0007572577
## X6 -1.443111e-04 0.0004715420
## X7 3.295380e-04 0.0008644031