Week_4_Assignment

Load the necessary libraries:

Yes! I am going to copy your code professor, the only thing I am going to add is an explanation of each R code chunk/line to understand and refer to later:

According to ChatGPT,

“The doParallel package in R provides a parallel backend for the foreach package, allowing you to execute loops in parallel using multiple processors or cores on your system”
“The foreach package provides a looping construct for executing R code iteratively”
“The jpeg package is used for reading and writing JPEG image files in R.”
“The EBImage package is an image processing toolbox for R.”

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(foreach)
library(jpeg)
library(EBImage)
files=list.files(path="/Users/salouadaouki/Desktop/Data605/jpg/")

View Shoes:

height=1200
width=2500
scale=20
plot_jpeg = function(path, add=FALSE) #initialize function named plot_jpeg with specified parameter path where the JPEG file is located
{
  require('jpeg') #this is to load jpeg package
  jpg <- readJPEG(path, native=TRUE) # read the file pf images while preserving the original colors
  res <- dim(jpg)[2:1] # get the resolution, [x is 2, y is 1] (getting the dimensions of the images)
  if (!add) # initialize an empty plot area if add==FALSE
    plot(1,1,xlim=c(1,res[1]),ylim=c(1,res[2]), #set the X Limits by size
         asp=1, #aspect ratio
         type='n', #don't plot
         xaxs='i',yaxs='i',#prevents expanding axis windows +6% as normal
         xaxt='n',yaxt='n',xlab='',ylab='', # no axes or labels
         bty='n') # no box around graph
  rasterImage(jpg,1,1,res[1],res[2]) #image, xleft,ybottom,xright,ytop, this function is to plot the JPEG image onto the initialized plot area.
}

Load the Data into an Array:

#initialize array with zeros.
im=array(rep(0,length(files)*height/scale*width/scale*3),
         #set dimension to N, x, y, 3 colors, 4D array)
         dim=c(length(files), height/scale, width/scale,3)) 
         #this represent the number of files, the height and the width of the images respectively

for (i in 1:length(files)){
  #define file to be read
  tmp=paste0('/Users/salouadaouki/Desktop/Data605/jpg/', files[i])
  #read the file, resize the images to the specified dimensions using the resize function from EBImage package  
  temp=EBImage::resize(readJPEG(tmp),height/scale, width/scale)
  #assign the resized image to the array
  im[i,,,]=array(temp,dim=c(1, height/scale, width/scale,3))
}

Actual Plots:

I changed the third line (the line for the loop) of this code chunk because the original one gave me this error Error in im[i, , , ] : subscript out of bounds; the “i” was exceeding the number of images in the array. The new code is determining the number of images to be plotted to be the up to 81 images or the totla number of images that are available.

par(mfrow=c(3,3)) #set graphics to 3 x 3 table
par(mai=c(.3,.3,.3,.3)) #set margins 
for (i in 1:min(81, dim(im)[1])){  #plot the first images only
plot_jpeg(writeJPEG(im[i,,,]))
}

Generate Principal Components:

TThis step is to normalize the data, according to Datacamp, normalization of the data “ensures that each attribute has the same level of contribution, preventing one variable from dominating others. For each variable, normalization is done by subtracting its mean and dividing by its standard deviation.”

# this step is to normalize the data 
height=1200
width=2500
scale=20
newdata=im
dim(newdata)=c(length(files),height*width*3/scale^2)
mypca=princomp(t(as.matrix(newdata)), scores=TRUE, cor=TRUE)
sum(mypca$sdev^2/sum(mypca$sdev^2)) #verify that sum of variance=1

## [1] 1

When I used this code on the range of [1:19], the output was NA, so I used the code line “number of elements in the vector” and it checked out that there are only 17. So I modified the code to range [1:17].

# Check the length of PCA variances
num_components <- length(mypca$sdev)  # Number of principal components

# If the length is less than 19, select all elements; otherwise, select the first 19 elements
if (num_components < 19) {
    selected_components <- mypca$sdev
} else {
    selected_components <- mypca$sdev[1:19]
}

# Calculate the sum of selected components
sum_selected_components <- sum(selected_components)

When I used the code was provided by the professor,

mycomponents=mypca$sdev^2/sum(mypca$sdev^2)
sum(mycomponents[1:19]) #first 19 component

## [1] NA

sum(mycomponents[1:79]) #first 79 components account for 90% of variability

## [1] NA

as you can see, the outputs were NA, so I had to check which numbers provide the account for 80% and 90% using the code chunk below:

# Sort the proportions of variance explained in descending order
sorted_variances <- sort(mycomponents, decreasing = TRUE)

# Calculate the cumulative sum of proportions
cumulative_variances <- cumsum(sorted_variances)

# Find the index where cumulative variance exceeds or equals 80%
index_80_percent <- which.max(cumulative_variances >= 0.8)

# Find the index where cumulative variance exceeds or equals 90%
index_90_percent <- which.max(cumulative_variances >= 0.9)

# Number of components required to account for 80% of the variability
num_components_80_percent <- index_80_percent

# Number of components required to account for 90% of the variability
num_components_90_percent <- index_90_percent

# Print the result
cat("Number of components required to account for 80% of the variability:", num_components_80_percent, "\n")

## Number of components required to account for 80% of the variability: 3

cat("Number of components required to account for 90% of the variability:", num_components_90_percent, "\n")

## Number of components required to account for 90% of the variability: 6

Here is the explanation of each line in the code above:

Sorting: The sort() function sorts the proportions of variance explained (mycomponents) in descending order. Cumulative Sum: The cumsum() function calculates the cumulative sum of the sorted variances. Finding Index: The which.max() function finds the index where the cumulative variance exceeds or equals 80% (and 90%) in separate lines. Number of Components: The result is the number of principal components required to account for 80% (and 90%) of the variability.

mycomponents=mypca$sdev^2/sum(mypca$sdev^2)
sum(mycomponents[1:3]) #first 3 components account for 80% of variability

## [1] 0.8451073

sum(mycomponents[1:6]) #first 6 components account for 90% of variability

## [1] 0.9076338

Eigenshoes:

To avoid the error from earlier “Error in mypca2[i, , , ] : subscript out of bounds” I adjusted the range in the code from [1:81] to [1:min(81, dim(im)[1]] to ensure that the loop iterates over which is smaller; 81 or the number of images I have in the dataset. I also modified the output to be a \(3*3\) table to accommodate the number of plots I have instead of 81.

mypca2 <- t(mypca$scores)
dim(mypca2) <- c(length(files), height/scale, width/scale, 3)
par(mfrow=c(3, 3))  # Adjusted to accommodate 17 plots
par(mai=c(.001, .001, .001, .001))
for (i in 1:min(81, dim(im)[1])) {  # Plot the first 81 Eigenshoes only
    plot_jpeg(writeJPEG(mypca2[i,,,], quality=1, bg="white"))
}

The entire code had to be modified, especially in the range, because I only have 17 images in the zip file I downloaded from Blackboard. Even though I copied the code from the .rmd file that Pr. Fulton provided us with, it was a struggle to make it work, which indicates that there is a succesfull learning process.