Reading in shoe images
shoes <- matrix(0, nrow=17, ncol=9000000)
files <- list.files("../data/shoes")
# Read in each image in our data directory
for (i in 1:length(files)){
filepath <- file.path("../data/shoes", files[i])
# Resizing images, otherwise we run into local memory problems as the image matrix becomes too large
img <- resizeImage(readJPEG(filepath, native=FALSE), 1200 / 20, 2500 / 20)
# Extract the RGB values from our array and append them to a single array
r <- as.vector(img[,,1])
g <- as.vector(img[,,2])
b <- as.vector(img[,,3])
vec <- c(r, g, b)
shoes[i, ] <- c(r, g, b)
}
print(dim(shoes))
## [1] 17 9000000
Let’s convert our matrix to a dataframe. We’re taking the transpose
(t()) of our image matrix because we want each column of
our dataframe to be a vector representing one input image.
shoes <- data.frame(x=t(shoes))
Now we need to center the mean dataframe using the built-in
scale function. The idea here is to make the subtract the
mean form our data, so that the dataset has a mean of zero.
scaled <- scale(shoes, center = TRUE, scale = TRUE)
mean_shoe <- attr(scaled, "scaled:center")
std_shoe <- attr(scaled, "scaled:scale")
Calculate covariance matrix of our scaled matrix. We can use the
built-in R cov function. Since we’re dealing with such a
large array, we’ll need to use the
Sys.setenv('R_MAX_VSIZE'=32000000000) command to allocate
enough memory.
sigma <- cov(scaled)
Get eigenvalues and eigenvectors from our covariance matrix. These should represent the principal components of our input dataset.
eigen <- eigen(sigma)
eigenvalues <- eigen$values
eigenvectors <- eigen$vectors
Choosing principal components from the
cumulative_sum <- cumsum(eigenvalues) / sum(eigenvalues)
# We want to account for 80% of the variability in our input image dataset, so we'll set the threshold criterion to 0.8
thres <- min(which(cumulative_sum > 0.80))
Creating a scree plot of our eigenvalues. The “turning point/elbow” of this graph shows us the number of eigenvalues/eigenvectors we should be using.
pca_data <- data.frame(eigenvalues, cumulative_sum)
ggplot(data=pca_data, aes(x = eigenvalues, y = cumulative_sum)) + geom_line()
scaling <- diag(eigenvalues[1:thres] ^ (-1/2)) / (sqrt(nrow(scaled)-1))
eigenshoes <- scaled %*% eigenvectors[, 1:thres] %*% scaling
# Get eigneshoe image and plot
eigenshoe <- array(eigenshoes[,2], c(60,125,3))
imageShow(eigenshoe)