1.Introduction

This paper is written for Unsupervised Machine Learning class at my Master’s of Data Science studies at the University of Warsaw. Basically, Principal Component Analysis (PCA) helps us to get the most important of the data out of huge datasets. In other words, PCA is reducing the number of irrelevant variables and we can get a much readable dataset to derive some patterns. On the other hand, the more we reduce our data, we get less precise perspectives.

1.1 Dataset and needed packages

# install.packages("jpeg")
# install.packages("factoextra")
# install.packages("gridExtra")
# install.packages("ggplot2")
# install.packages("magick")
# install.packages("imgpalr")
library(imgpalr)
library(jpeg)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
library(ggplot2)
library(magick)
## Linking to ImageMagick 6.9.11.57
## Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fontconfig, x11

As an image, I am going to use random flower photo taken from google. By using rasterImage() function I will give a frame to the photo.

photo<-readJPEG("~\\db\\flower.jpg")
plot(1, type="n") # plotting the rasterImage - colour photo
rasterImage(photo, 0.6, 0.6, 1.4, 1.4)

In the color photo, we get three matrices pixel by pixel and each of these matrices is for one component of RGB color. Moreover, it’s an easy way to convert to grayscale is to sum up RGB shades and divide by max value to scale up to max. 1.

2. Principal Component Analysis

photo.sum<-photo[,,1]+photo[,,2]+photo[,,3] # summing up RGB shades
photo.bw<-photo.sum/max(photo.sum)          # dividing by max
plot(1, type="n")                             # plotting the rasterImage - black & white photo
rasterImage(photo.bw, 0.6, 0.6, 1.4, 1.4)

writeJPEG(photo.bw, "photo_bw.jpg")

Principal Component Analysis (PCA) for pictures is to run individual PCA on each of shades, which are R, G, and B. So it gives us an eigenvector of shades. The first eigenvector shows the majority of variance of shades and the next vectors a little bit less. We have to integrate new shades into the picture. And this new value comes from multiplying “x” and “rotation” components of PCA. Here each color scale (R, G, B) gets its own matrix and PCA since we work on matrices.

r<-photo[,,1]
# individual matrix of R color component

g<-photo[,,2]
# individual matrix of G color component

b<-photo[,,3]
# individual matrix of B color component

Now, we use Principal Component Analysis (PCA) for each color components and will merge them into one object

r.pca<-prcomp(r, center=FALSE, scale.=FALSE)
# PCA for R color component
g.pca<-prcomp(g, center=FALSE, scale.=FALSE)
# PCA for G color component
b.pca<-prcomp(b, center=FALSE, scale.=FALSE)
# PCA for B color component

rgb.pca<-list(r.pca, g.pca, b.pca)
# merging all PCA into one object

Since we got our results, we can see the importance of Principal Components(PC)

library(gridExtra)
f1<-fviz_eig(r.pca, main="Red", barfill="red", ncp=5, addlabels=TRUE)
f2<-fviz_eig(g.pca, main="Green", barfill="green", ncp=5, addlabels=TRUE)
f3<-fviz_eig(b.pca, main="Blue", barfill="blue", ncp=5, addlabels=TRUE)
grid.arrange(f1, f2, f3, ncol=3)

Based on the results, we see here relevance. What we can do here is to create the loop for different photos. All and all, it appears that the number of pixels in the photo defines the number of principal components (PC) because it’s about the level of the complexity of the picture. Here in the code below, we make 9 photos, with min. 3 & max equal to the number of PCs. Then we multiply x * rotation and implement it on the existing pixels grid. As a final step, we save all photos into the eworking directory as objects.

vec<-seq.int(3, round(nrow(photo)), length.out=9)
for(i in vec){
photo.pca<-sapply(rgb.pca, function(j) {
    new.RGB<-j$x[,1:i] %*% t(j$rotation[,1:i])}, simplify="array")
assign(paste("photo_", round(i,0), sep=""), photo.pca) # saving as object
writeJPEG(photo.pca, paste("photo_", round(i,0), "_princ_comp.jpg", sep=""))
}
# easy plotting of new photo
plot(image_read(photo_3))

Let’s check the number of PC

round(vec,0) 
## [1]   3  31  58  86 114 142 170 197 225

As we see we have 9 PCs.

collective plotting 9 images - mechanism of retrieving names of saved files

par(mfrow=c(3,3)) 
par(mar=c(1,1,1,1))
plot(image_read(get(paste("photo_", round(vec[1],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[2],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[3],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[4],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[5],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[6],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[7],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[8],0), sep=""))))
plot(image_read(get(paste("photo_", round(vec[9],0), sep=""))))

We see a blur and different versions of the picture thanks to the loop that we created.

library(abind)
pp=10 # how many principal components should be included
photo.pca2<-abind(r.pca$x[,1:pp] %*% t(r.pca$rotation[,1:pp]),
                  g.pca$x[,1:pp] %*% t(g.pca$rotation[,1:pp]),
                  b.pca$x[,1:pp] %*% t(b.pca$rotation[,1:pp]),
                  along=3)
plot(image_read(photo.pca2))

Let’s check how the size of the photo decreases under this PS compression. I will check how the size of the photo decreases under Principal Component compression. The code below will give us the table of results.

#install.packages("Metrics")
library(Metrics)

sizes<-matrix(0, nrow=9, ncol=4)
colnames(sizes)<-c("Number of PC", "Photo size", "Compression ratio", "MSE-Mean Squared Error")
sizes[,1]<-round(vec,0)
for(i in 1:9) {
path<-paste("photo_", round(vec[i],0), "_princ_comp.jpg", sep="")
sizes[i,2]<-file.info(path)$size 
photo_mse<-readJPEG(path)
sizes[i,4]<-mse(photo, photo_mse) # from Metrics::
}
sizes[,3]<-round(as.numeric(sizes[,2])/as.numeric(sizes[9,2]),3)
sizes
##       Number of PC Photo size Compression ratio MSE-Mean Squared Error
##  [1,]            3       6908             0.547            0.031869333
##  [2,]           31      11744             0.931            0.008339967
##  [3,]           58      12666             1.004            0.005451213
##  [4,]           86      12877             1.020            0.004205556
##  [5,]          114      12877             1.020            0.003495540
##  [6,]          142      12788             1.013            0.002922083
##  [7,]          170      12599             0.998            0.001967686
##  [8,]          197      12619             1.000            0.001715570
##  [9,]          225      12619             1.000            0.001715038

3. Interpretation of results

3.1 Photo file size

It is the most important one. The aim of this is to reduce the size of the image. To know this, we have to check the final result of the image.

3.2 Photo Principal Components vs Mean Squared Error - MSE

Mean Squared Error is the measure of the difference between original and compressed photos. As the number of Principal Components lower, the more image simplifies.

References

  1. Jacek Lewkowicz. Unsupervised Learning. Presentation from the classes.

  2. https://github.com/majagarbulinska/PCAGreyImageReconstruction

  3. Image: https://ibb.co/106r3Fh