Reading in shoe images

shoes <- matrix(0, nrow=17, ncol=9000000)
files <- list.files("../data/shoes")

# Read in each image in our data directory
for (i in 1:length(files)){
  filepath <- file.path("../data/shoes", files[i])
  
  # Resizing images, otherwise we run into local memory problems as the image matrix becomes too large
  img <- resizeImage(readJPEG(filepath, native=FALSE), 1200 / 20, 2500 / 20)
  
  # Extract the RGB values from our array and append them to a single array
  r  <- as.vector(img[,,1])
  g  <- as.vector(img[,,2])
  b  <- as.vector(img[,,3])
  vec <- c(r, g, b)
  shoes[i, ] <- c(r, g, b)


}

print(dim(shoes))
## [1]      17 9000000

Let’s convert our matrix to a dataframe. We’re taking the transpose (t()) of our image matrix because we want each column of our dataframe to be a vector representing one input image.

shoes <- data.frame(x=t(shoes))

Now we need to center the mean dataframe using the built-in scale function. The idea here is to make the subtract the mean form our data, so that the dataset has a mean of zero.

scaled <- scale(shoes, center = TRUE, scale = TRUE)
mean_shoe <- attr(scaled, "scaled:center")
std_shoe  <- attr(scaled, "scaled:scale")

Calculate covariance matrix of our scaled matrix. We can use the built-in R cov function. Since we’re dealing with such a large array, we’ll need to use the Sys.setenv('R_MAX_VSIZE'=32000000000) command to allocate enough memory.

sigma <- cov(scaled)

Get eigenvalues and eigenvectors from our covariance matrix. These should represent the principal components of our input dataset.

eigen <- eigen(sigma)
eigenvalues  <- eigen$values
eigenvectors <- eigen$vectors

Choosing principal components from the

cumulative_sum <- cumsum(eigenvalues) / sum(eigenvalues)
# We want to account for 80% of the variability in our input image dataset, so we'll set the threshold criterion to 0.8
thres <- min(which(cumulative_sum > 0.80))

Creating a scree plot of our eigenvalues. The “turning point/elbow” of this graph shows us the number of eigenvalues/eigenvectors we should be using.

pca_data <- data.frame(eigenvalues, cumulative_sum)
ggplot(data=pca_data, aes(x = eigenvalues, y = cumulative_sum)) + geom_line()

scaling <- diag(eigenvalues[1:thres] ^ (-1/2)) / (sqrt(nrow(scaled)-1))
eigenshoes <- scaled %*% eigenvectors[, 1:thres] %*% scaling

# Get eigneshoe image and plot
eigenshoe <- array(eigenshoes[,2],  c(60,125,3))
imageShow(eigenshoe)

References

Duggal, Nikita. 2023. Difference Between Covariance and Correlation: A Definitive Guide. https://www.simplilearn.com/covariance-vs-correlation-article.
Fulton, Larry. 2020. Eigenshoes. https://rpubs.com/R-Minator/eigenshoes.
Herrero, Diego. 2019. Face Recognition Using Eigenfaces. https://rpubs.com/dherrero12/543854.
Jaadi, Zakaria. 2022. A Step-by-Step Explanation of Principal Component Analysis (PCA). https://builtin.com/data-science/step-step-explanation-principal-component-analysis.