With the attached data file, build and visualize eigenimagery that accounts for 80% of the variability. Provide full R code and discussion.

The jpg folder that contains a list of 17 images is kept in the local directory, Load the first image with empty matrix of dimension 1200 x 2500 x 3. Then, a 17 by 9,000,000 matrix is created to store image data based on the dimensions of the images in the “jpg” folder

library(OpenImageR)
library(jpeg)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Path to the "jpg" folder
path <- file.path(getwd(), 'jpg')

# List all image files with ".jpg" extension
image_files <- list.files(path = path, pattern = "*.jpg", full.names = TRUE)

# Load the first image to get its dimensions
image <- readJPEG(image_files[1])
image_dimensions <- dim(image)  # This will be a vector containing the dimensions c(1200, 2500, 3)


# Plot jpeg images
plot_jpeg <- function(path, add = FALSE) {           # define the plot_jpeg function
  jpg <- readJPEG(path, native = TRUE)  # read the file
  res <- dim(jpg)[2:1]  # get the resolution, [x, y]
  if (!add)  # initialize an empty plot area if add == FALSE
    plot(1, 1, xlim = c(1, res[1]), ylim = c(1, res[2]), asp = 1, type = 'n', 
         xaxs = 'i', yaxs = 'i', xaxt = 'n', yaxt = 'n', xlab = '', ylab = '', bty = 'n')
  rasterImage(jpg, 1, 1, res[1], res[2])
}

# Set up plot parameters
par(mfrow = c(3, 6))
par(mai = c(0.01, 0.01, 0.01, 0.01))

# Loop through each image file and plot it
for (i in 1:length(image_files)) {
  plot_jpeg(image_files[i])
}

All 17 images in the “jpg” folder are transformed into vectors of their R, G, and B components and then transpose of the matrix should result 9,000,000 rows and 17 columns.

height <- 1200
width <- 2500
scale <- 20
# Create an empty matrix to store the image data
num_images <- length(image_files)
image_matrix <- matrix(0, nrow = prod(dim(readJPEG(image_files[1]))), ncol = num_images)


# Load and convert images to vectors
for (i in 1:num_images) {
  img <- readJPEG(image_files[i])
  img_vector <- as.vector(img)
  image_matrix[, i] <- img_vector
}

# Transpose the image_matrix that contains the image data with rows representing pixel components (R, G, B) and columns representing individual images. 
image_matrix <- t(image_matrix)

#create images data frame
df <- as.data.frame(t(image_matrix))

This shoes data frame has 9,000,000 observations and 17 variables, with each variable represents an image.

Center the data by subtracting the mean. Scale the data by dividing by the standard deviation. Compute covariance matrix that is a square matrix where each element (i, j) represents the covariance between the i-th and j-th variables.

# Center the data (subtract the mean) and compute the covariance matrix
centered_shoes <- scale(df, center = TRUE, scale = TRUE)
sigma <- cov(centered_shoes)

These eigenvectors and eigenvalues are often used in Principal Component Analysis (PCA) for dimensionality reduction and feature extraction. The cumulative variance explained by the eigenvalues is calculated in determining the threshold for variability.

# Compute the eigenvalues and eigenvectors of the covariance matrix
eig <- eigen(sigma)
eigenvalues <- eig$values
eigenvectors <- eig$vectors

cum_var <- cumsum(eigenvalues) / sum(eigenvalues) # Calculate the cumulative variance 
cum_var
##  [1] 0.6833138 0.7824740 0.8353528 0.8629270 0.8825040 0.8996099 0.9144723
##  [8] 0.9271856 0.9374462 0.9472860 0.9561859 0.9647964 0.9732571 0.9804242
## [15] 0.9874038 0.9941511 1.0000000
cum_var_df <- data.frame( 
  num = 1:length(cum_var),
  cum_var = cum_var
)

# Plot the cumulative variance
ggplot(cum_var_df, aes(x = num, y = cum_var)) +
  geom_line() +
  geom_point() +
  xlab(" Number of Principle Components") + 
  ylab("Cumulative Variance") +  
  ggtitle("Cumulative Variance Explained by Principal Components")  

The plot visualizes how quickly the cumulative variance reaches a desired level to understand the contribution of each eigenvalue (principal component) to the total variance.

threshold = min(which(cum_var > .80))
threshold
## [1] 3

The index of the first cumulative variance that exceeds 80% is stored in “exceeds” variable which corresponds to the number of principal components required to explain at least 80% of the variability. The result shows 3 images

Compute Eigen Images

# Compute scaling factor
scaling_shoes <- diag(eig$values[1:threshold]^(-1/2)) / (sqrt(nrow(centered_shoes)-1))

# Multiply centered matrix by top 2 components, then by scaling factor
eigenshoes <- centered_shoes %*% eig$vectors[,1:threshold] %*% scaling_shoes

# Convert to dimensions of original images, and select one image to display
eigenimage <- array(eigenshoes[,1], dim(image))
imageShow(eigenimage)

#Display the second eigenimage (principal component) from the eigenshoes
eigenimage2 <- array(eigenshoes[,2], dim(image))
imageShow(eigenimage2)