Principal Components Analysis (PCA) is an unsupervised approach to reducing the dimensionality of a dataset. Reducing the number of variables in a dataset confers three advantages. First, it makes explanations of variability easier for humans to understand. Second, it reduces the size of the dataset, simplifying the computations required for further analysis. Finally, PCA offers an efficient way to allow us to consider any observation as a location in a multi-dimensional space. This means that describing an observation amounts to identifying the coordinates of its location in the space. This list of coordinates is often much less information than would be required to transmit the entire observation.
This document shows a process for determining principal components for a dataset containing 17 images of shoes. The majority of the code here was previously published at https://rpubs.com/R-Minator/eigenshoes. My aim in this document is to give my best explanation for this code, elaborate on it where possible, and record specific questions for further study.
1A. Identify the number and location of images in the dataset.
num = 17
files = list.files("C:/jpgs", pattern = "\\.jpg")[1:num]
1B. Load the View Shoe function
height = 1200; width = 2500; scale = 20
plot_jpeg = function(path, add = FALSE){
jpg = readJPEG(path, native = T)
res = dim(jpg)[2:1]
if (!add)
plot(1, 1, xlim = c(1, res[1]), ylim = c(1, res[2]),
asp = 1, type = 'n', xaxs = 'i', yaxs = 'i',
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '', bty = 'n')
rasterImage(jpg, 1, 1, res[1], res[2])
}
1C. Initialize im, an array for holding image data. The array has dimensions (number of images) x (pixels along height of scaled image) x (pixels along width of scaled image) x (3 channels per pixel).
im = array(rep(0, length(files) * height/scale * width/scale * 3),
dim=c(length(files), height/scale, width/scale, 3))
1D. Populate the array. For each image, create temp, which is the image scaled by a factor of 1/scale = 1/20. Load the scaled image information into the ith place of im. Recall that temp is an array with dimensions (1 image) x (pixels along height of scaled image) x (pixels along width of scaled image) x (3 channels per pixel).
for (i in 1:num){
temp = resize(readJPEG(paste0("C:/jpgs/", files[i])),
height/scale, width/scale)
im[i,,,] = array(temp, dim = c(1, height/scale, width/scale, 3))
}
1E. Rearrange the data in im as flat, a matrix with dimensions (number of images) x ((pixels along height of scaled image) x (pixels along width of scaled image) x (3 channels per pixel)). flat contains one row for each image and one column for each pixel-channel of the scaled image.
flat = matrix(0, num, height/scale * width/scale * 3)
for (i in 1:num){
r = as.vector(im[i,,,1])
g = as.vector(im[i,,,2])
b = as.vector(im[i,,,3])
flat[i,] <- t(c(r, g, b))
}
1F. Display the shoes.
par(mfrow = c(3, 3))
par(mai = c(.3, .3, .3, .3))
for (i in 1:num){
plot_jpeg(writeJPEG(im[i,,,]))
}
1G. Rearrange the data in flat as shoes, a data frame that is the transpose of flat. Now each column represents the image information of one shoe.
shoes = as.data.frame(t(flat))
Working with the transpose of flat means that \(\text{shoes}^T \times \text{shoes}\), a component of the covariance matrix, will have reasonable dimensions. But it also means that we’re now treating each image as a variable, and each pixel-position as a single observation. For example, the [1,1]th pixel is considered as an observation with 17 values: its value in the first image, its value in the second image, etc. This seems inconsistent with one of the goals of PCA, which would be to take a new observation (shoe image) and locate it with respect to other observations. Adding an additional shoe image should add a row to our data table–an observation. But under the current arrangement, adding an additional shoe contributes a new column to the data table.
Why do we work with the transpose of flat instead of with flat itself?
2A. Transform shoes to construct scaled. shoes is transformed so that each value is sent to its z-score with respect to the other values in its column. That is, subtract the column-mean from each value, and divide by the column’s standard deviation.
scaled = scale(shoes, center = TRUE, scale = TRUE)
mean.shoe = attr(scaled, "scaled:center")
std.shoe = attr(scaled, "scaled:scale")
What information is stored in mean.shoe and std.shoe? It looks like mean.shoe stores the column means of scaled. But scaled is constructed so that each column mean is zero. And it looks like std.shoe stores the column standard deviations of scaled. But scaled is constructed so that each column sd is 1.
3A. Compute the covariance matrix of scaled, which contains the standardized image data.
Sigma_ = cor(scaled)
4A. Compute myeigen, the eigenvectors and eigenvalues of the covariance matrix, Sigma_.
myeigen = eigen(Sigma_)
4B. Display the cumulative variance accounted for by successive eigenvectors.
cumsum(myeigen$values) / sum(myeigen$values)
## [1] 0.6928202 0.7940449 0.8451072 0.8723847 0.8913841 0.9076337 0.9216282
## [8] 0.9336889 0.9433871 0.9524454 0.9609037 0.9688907 0.9765235 0.9832209
## [15] 0.9894033 0.9953587 1.0000000
Notice that the length of an eigenvector is 1, and notice that distinct eigenvectors are orthogonal.
dot(myeigen$vectors[,4], myeigen$vectors[,4])
## [1] 1
dot(myeigen$vectors[,4], myeigen$vectors[,5])
## [1] -2.803747e-16
scaling = diag(myeigen$values[1:5]^(-1/2)) / (sqrt(nrow(scaled) - 1))
eigenshoes = scaled %*% myeigen$vectors[,1:5] %*% scaling
for (i in 1:5){
imageShow(array(eigenshoes[,i], c(60, 125, 3)))
}
The matrix scaling is a diagonal matrix with entries containing \(\frac{1}{\sqrt{\lambda_{i} \times (n - 1)}}\) for \(i = 1, 2, ..., 5\). But I’m not sure how to interpret these images. Are these images projections of the original shoe images onto the “shoe-space” spanned by the first five principal components? If so, how does the matrix eigenshoes accomplish the projection? What is the meaning of the values in scaling?
The second group of images called Eigenshoes is generated by the following.
height = 1200
width = 2500
scale = 20
newdata = im
dim(newdata) = c(length(files), height * width * 3 / scale ^ 2)
mypca = princomp(t(as.matrix(newdata)), scores=TRUE, cor=TRUE)
mypca2 = t(mypca$scores)
dim(mypca2) = c(length(files), height/scale, width/scale, 3)
par(mfrow = c(5, 5))
par(mai = c(.001, .001, .001, .001))
for (i in 1:num){
plot_jpeg(writeJPEG(mypca2[i,,,], bg="white"))
}
Here are 17 Eigenshoes distinct from those generated by the code above. Since they’re generated by princomp(), I take these to be the principal components of the original dataset. The fact that there are 17 suggests to me that we’ve been treating each shoe image as a variable, since the number of principal components generated equals the number of variables in the original dataset.
If we were to add additional shoe images to this dataset and follow the same procedure, we would generate additional principal components. (Notice that this dataset generates 17 principal components, which equals the number of shoes. And the dataset at https://rpubs.com/R-Minator/eigenshoes generates 20 principal components, which is the number of shoes in the dataset examined there.) This seems contrary to procedures and results elsewhere (for example, at https://uc-r.github.io/pca).
This question about what counts as an observation and what counts as a variable for this procedure is where I’m currently stuck. I’m not sure how I would use these principal components to locate a new observation in the “Eigenshoe-space.” And I’m not sure how I could use these principal components to reconstruct an element from the original dataset.