The key aspect of matrix data is that every element of the matrix is the same type and represents the same kind of measurement. This is in contrast to a data frame, where every column of a data frame can potentially be of a different class.
Matrix data have some special statistical methods that can be applied to them. One category of statistical dimension reduction techniques is commonly called principal components analysis (PCA) or the singular value decomposition (SVD).
These techniques generally are applied in situations where the rows of a matrix represent observations of some sort and the columns of the matrix represent features or variables (but this is by no means a requirement).
In an abstract sense, the SVD or PCA can be thought of as a way to approximate a high-dimensional matrix (i.e. a large number of columns) with a a few low-dimensional matrices. So there’s a bit of data compression angle to it. We’ll take a look at what’s going on in this chapter.
First, we can simulate some matrix data. Here, we simulate some random Normal data in a matrix that has 40 rows and 10 columns.
set.seed(12345)
dataMatrix <- matrix(rnorm(400), nrow = 40)
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
par(mar = rep(0.2, 4))
heatmap(dataMatrix)
Not surprisingly, there aren’t really any interesting patterns given that we just simulated random noise. At least it’s good to know that the clustering algorithm won’t pick up something when there’s nothing there!
But now what if there were a pattern in the data? How would we discover it?
Let’s first simulate some data that indeed does have a pattern. In the code below, we cycle through all the rows of the matrix and randomly add 3 to the last 5 columns of the matrix.
set.seed(678910)
# Coin flip pattern - yellow : higher value, red : lower value.
for (i in 1:40) {
coinFlip <- rbinom(1, size = 1, prob = 0.5)
## If coin is heads add a common pattern to that row
if (coinFlip) {
dataMatrix[i, ] <- dataMatrix[i, ] + rep(c(0, 3), each = 5)
}
}
# Image of new data
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
heatmap(dataMatrix)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hh <- dist(dataMatrix) %>% hclust
dataMatrixOrdered <- dataMatrix[hh$order, ]
par(mfrow = c(1, 3))
## Complete data
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
## Show the row means
plot(rowMeans(dataMatrixOrdered), 40:1, , xlab = "Row Mean", ylab = "Row", pch = 19)
## Show the column means
plot(colMeans(dataMatrixOrdered), xlab = "Column", ylab = "Column Mean", pch = 19)
If X is a matrix with each variable in a column and each observation in a row then the SVD is a matrix decomposition that represents X as a matrix product of three matrices:
where the columns of U (left singular vectors) are orthogonal, the columns of \(V\) (right singular vectors) are orthogonal and \(D\) is a diagonal matrix of singular values.
dimmension-reduction-orthogonal.png
Principal components analysis (PCA) is simply an application of the SVD.
The principal components are equal to the right singular values if you first scale the data by subtracting the column mean and dividing each column by its standard deviation (that can be done with the scale() function).
# Execute SVD - Singular Value Decomposition
svd1 <- svd(scale(dataMatrixOrdered))
The svd() function returns a list containing three components named u, d, and v.
The u and v components correspond to the matrices of left and right singular vectors, respectively, while the d component is a vector of singular values, corresponding to the diagonal of the matrix D described above.
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = "Original Data")
plot(svd1$u[, 1], 40:1, , ylab = "Row", xlab = "First left singular vector", pch = 19)
plot(svd1$v[, 1], xlab = "Column", ylab = "First right singular vector", pch = 19)
X = u1v1’
## Approximate original data with outer product of first singular vectors
approx <- with(svd1, outer(u[, 1], v[, 1]))
## Plot original data and approximated data
par(mfrow = c(1, 2))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = "Original Matrix")
image(t(approx)[, nrow(approx):1], main = "Approximated Matrix")
The statistical interpretation of singular values is in the form of variance in the data explained by the various components. The singular values produced by the svd() are in order from largest to smallest and when squared are proportional the amount of variance explained by a given singular vector.
To show how this works, here’s a very simple example. First, we’ll simulate a “dataset” that just takes two values, 0 and 1.
constantMatrix <- dataMatrixOrdered * 0
for (i in 1:dim(dataMatrixOrdered)[1]) {
constantMatrix[i, ] <- rep(c(0, 1), each = 5)
}
svd1 <- svd(constantMatrix)
par(mfrow = c(1, 3))
image(t(constantMatrix)[, nrow(constantMatrix):1], main = "Original Data")
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Prop. of variance explained", pch = 19)
As we can see from the right-most plot, 100% of the variation in this “dataset” can be explained by the first singular value. Or, all of the variation in this dataset occurs in a single dimension. This is clear because all of the variation in the data occurs as you go from left to right across the columns. Otherwise, the values of the data are constant.
In the plot below, we plot the singular values (left) and the proportion of variance explained for the slightly more complex dataset that we’d been using previously.
par(mfrow = c(1, 2))
svd1 <- svd(scale(dataMatrixOrdered))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Prop. of variance explained", pch = 19)
We can see that the first component explains about 40% of all the variation in the data. In other words, even though there are 10 dimensions in the data, 40% of the variation in the data can be explained by a single dimension.
That suggests that the data could be simplified quite a bit, a phenomenon we observed in the last section where it appeared the data could be reasonably approximated by the first left and right singular vectors.
As we mentioned above, the SVD has a close connection to principal components analysis (PCA). PCA can be applied to the data by calling the prcomp() function in R.
Here, we show that the first right singular vector from the SVD is equal to the first principal component vector returned by PCA.
# Scale the data
svd1 <- svd(scale(dataMatrixOrdered))
# Execute PCA on the scaled data
pca1 <- prcomp(dataMatrixOrdered, scale = TRUE)
# Plot the results
plot(pca1$rotation[, 1], svd1$v[, 1], pch = 19, xlab = "Principal Component 1", ylab = "Right Singular Vector 1")
# Add a line
abline(c(0, 1))
Whether you call this procedure SVD or PCA really just depends on who you talk to.
Statisticians and people with that kind of background will typically call it PCA while engineers and mathematicians will tend to call it SVD.
set.seed(678910)
for (i in 1:40) {
coinFlip1 <- rbinom(1, size = 1, prob = 0.5)
coinFlip2 <- rbinom(1, size = 1, prob = 0.5)
if (coinFlip1) {
## Pattern 1
dataMatrix[i, ] <- dataMatrix[i, ] + rep(c(0, 5), each = 5)
}
if (coinFlip2) {
## Pattern 2
dataMatrix[i, ] <- dataMatrix[i, ] + rep(c(0, 5), 5)
}
}
hh <- hclust(dist(dataMatrix))
dataMatrixOrdered <- dataMatrix[hh$order, ]
Here is a plot of this new dataset along with the two different patterns.
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1], main = "Data")
plot(rep(c(0, 1), each = 5), pch = 19, xlab = "Column", ylab = "Pattern 1", main = "Block pattern")
plot(rep(c(0, 1), 5), pch = 19, xlab = "Column", ylab = "Pattern 2", main = "Alternating pattern")
Now, of course the plot above shows the truth, which in general we will not know - we need to learn the truth from the data.
NOTE - the “truth” is that the first pattern is a shift pattern where the 1st 5 columns have a low mean and the 2nd 5 columns have a high mean, whereas the 2nd patter is an alternating one with one column being a low mean, and the next a high mean, and so on …
We can apply the SVD/PCA to this matrix and see how well the patterns are picked up.
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot(svd2$v[, 1], pch = 19, xlab = "Column", ylab = "First right singular vector")
plot(svd2$v[, 2], pch = 19, xlab = "Column", ylab = "Second right singular vector")
We can see that the first right singular vector seems to pick up both the alternating pattern as well as the block/step pattern in the data. The second right singular vector seems to pick up a similar pattern.
When we look at the variance explained, we can see that the first singular vector picks up a little more than 50% of the variation in the data.
svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1, 2))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Percent of variance explained", pch = 19)
Missing values are a problem that plagues any data analysis and the analysis of matrix data is no exception. Most SVD and PCA routines simply cannot be applied if there are missing values in the dataset. In the event of missing data, there are typically a series of questions that should be asked:
Determine the reason for the missing data; what is the process that lead to the data being missing?
Is the proportion of missing values so high as to invalidate any sort of analysis?
Is there information in the dataset that would allow you to predict/infer the values of the missing data?
In the example below, we take our dataset and randomly insert some missing data.
dataMatrix2 <- dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] <- NA
# svd1 <- svd(scale(dataMatrix2))
# install the package
# install.packages("impute")
# library(impute)
package ‘impute’ is not available (for R version 3.2.5)
For PCA/SVD, the scale/units of the data matters:
Alternatives and variations
Latent semantic analysis
LINKS: