Yes! I am going to copy your code professor, the only thing I am going to add is an explanation of each R code chunk/line to understand and refer to later:
According to ChatGPT,
“The doParallel
package in R
provides a parallel backend for the
foreach
package, allowing you to execute
loops in parallel using multiple processors or cores on your
system”
“The foreach
package provides a
looping construct for executing R code iteratively”
“The jpeg
package is used for
reading and writing JPEG image files in R.”
“The EBImage
package is an image
processing toolbox for R.”
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(foreach)
library(jpeg)
library(EBImage)
files=list.files(path="/Users/salouadaouki/Desktop/Data605/jpg/")
height=1200
width=2500
scale=20
plot_jpeg = function(path, add=FALSE) #initialize function named plot_jpeg with specified parameter path where the JPEG file is located
{
require('jpeg') #this is to load jpeg package
jpg <- readJPEG(path, native=TRUE) # read the file pf images while preserving the original colors
res <- dim(jpg)[2:1] # get the resolution, [x is 2, y is 1] (getting the dimensions of the images)
if (!add) # initialize an empty plot area if add==FALSE
plot(1,1,xlim=c(1,res[1]),ylim=c(1,res[2]), #set the X Limits by size
asp=1, #aspect ratio
type='n', #don't plot
xaxs='i',yaxs='i',#prevents expanding axis windows +6% as normal
xaxt='n',yaxt='n',xlab='',ylab='', # no axes or labels
bty='n') # no box around graph
rasterImage(jpg,1,1,res[1],res[2]) #image, xleft,ybottom,xright,ytop, this function is to plot the JPEG image onto the initialized plot area.
}
#initialize array with zeros.
im=array(rep(0,length(files)*height/scale*width/scale*3),
#set dimension to N, x, y, 3 colors, 4D array)
dim=c(length(files), height/scale, width/scale,3))
#this represent the number of files, the height and the width of the images respectively
for (i in 1:length(files)){
#define file to be read
tmp=paste0('/Users/salouadaouki/Desktop/Data605/jpg/', files[i])
#read the file, resize the images to the specified dimensions using the resize function from EBImage package
temp=EBImage::resize(readJPEG(tmp),height/scale, width/scale)
#assign the resized image to the array
im[i,,,]=array(temp,dim=c(1, height/scale, width/scale,3))
}
I changed the third line (the line for the loop) of this code chunk
because the original one gave me this error
Error in im[i, , , ] : subscript out of bounds
;
the “i” was exceeding the number of images in the
array. The new code is determining the number of images to be plotted to
be the up to 81 images or the totla number of images that are
available.
par(mfrow=c(3,3)) #set graphics to 3 x 3 table
par(mai=c(.3,.3,.3,.3)) #set margins
for (i in 1:min(81, dim(im)[1])){ #plot the first images only
plot_jpeg(writeJPEG(im[i,,,]))
}
TThis step is to normalize the data, according to Datacamp, normalization of the data “ensures that each attribute has the same level of contribution, preventing one variable from dominating others. For each variable, normalization is done by subtracting its mean and dividing by its standard deviation.”
# this step is to normalize the data
height=1200
width=2500
scale=20
newdata=im
dim(newdata)=c(length(files),height*width*3/scale^2)
mypca=princomp(t(as.matrix(newdata)), scores=TRUE, cor=TRUE)
sum(mypca$sdev^2/sum(mypca$sdev^2)) #verify that sum of variance=1
## [1] 1
When I used this code on the range of [1:19], the output was NA, so I used the code line “number of elements in the vector” and it checked out that there are only 17. So I modified the code to range [1:17].
# Check the length of PCA variances
num_components <- length(mypca$sdev) # Number of principal components
# If the length is less than 19, select all elements; otherwise, select the first 19 elements
if (num_components < 19) {
selected_components <- mypca$sdev
} else {
selected_components <- mypca$sdev[1:19]
}
# Calculate the sum of selected components
sum_selected_components <- sum(selected_components)
When I used the code was provided by the professor,
mycomponents=mypca$sdev^2/sum(mypca$sdev^2)
sum(mycomponents[1:19]) #first 19 component
## [1] NA
sum(mycomponents[1:79]) #first 79 components account for 90% of variability
## [1] NA
as you can see, the outputs were NA, so I had to check which numbers provide the account for 80% and 90% using the code chunk below:
# Sort the proportions of variance explained in descending order
sorted_variances <- sort(mycomponents, decreasing = TRUE)
# Calculate the cumulative sum of proportions
cumulative_variances <- cumsum(sorted_variances)
# Find the index where cumulative variance exceeds or equals 80%
index_80_percent <- which.max(cumulative_variances >= 0.8)
# Find the index where cumulative variance exceeds or equals 90%
index_90_percent <- which.max(cumulative_variances >= 0.9)
# Number of components required to account for 80% of the variability
num_components_80_percent <- index_80_percent
# Number of components required to account for 90% of the variability
num_components_90_percent <- index_90_percent
# Print the result
cat("Number of components required to account for 80% of the variability:", num_components_80_percent, "\n")
## Number of components required to account for 80% of the variability: 3
cat("Number of components required to account for 90% of the variability:", num_components_90_percent, "\n")
## Number of components required to account for 90% of the variability: 6
Here is the explanation of each line in the code above:
Sorting: The sort() function sorts the proportions of variance explained (mycomponents) in descending order. Cumulative Sum: The cumsum() function calculates the cumulative sum of the sorted variances. Finding Index: The which.max() function finds the index where the cumulative variance exceeds or equals 80% (and 90%) in separate lines. Number of Components: The result is the number of principal components required to account for 80% (and 90%) of the variability.
mycomponents=mypca$sdev^2/sum(mypca$sdev^2)
sum(mycomponents[1:3]) #first 3 components account for 80% of variability
## [1] 0.8451073
sum(mycomponents[1:6]) #first 6 components account for 90% of variability
## [1] 0.9076338
To avoid the error from earlier “Error in mypca2[i, , , ] : subscript out of bounds” I adjusted the range in the code from [1:81] to [1:min(81, dim(im)[1]] to ensure that the loop iterates over which is smaller; 81 or the number of images I have in the dataset. I also modified the output to be a \(3*3\) table to accommodate the number of plots I have instead of 81.
mypca2 <- t(mypca$scores)
dim(mypca2) <- c(length(files), height/scale, width/scale, 3)
par(mfrow=c(3, 3)) # Adjusted to accommodate 17 plots
par(mai=c(.001, .001, .001, .001))
for (i in 1:min(81, dim(im)[1])) { # Plot the first 81 Eigenshoes only
plot_jpeg(writeJPEG(mypca2[i,,,], quality=1, bg="white"))
}
The entire code had to be modified, especially in the range, because I only have 17 images in the zip file I downloaded from Blackboard. Even though I copied the code from the .rmd file that Pr. Fulton provided us with, it was a struggle to make it work, which indicates that there is a succesfull learning process.