Introduction

Principal Component Analysis, here after referred to as PCA, is a is a dimensionality reduction technique that is often used to reduce the dimension of large data sets, by transforming a large set of variables into a smaller ones without loosing the major information in the larger dataset.

Image compression is a real example of dimensionality reduction that can be done using PCA, where the image data is dimensionally reduced but still produces a similar image (core information is not lost).

Libraries and Setup

library(jpeg) #for readJPEF()
library(magick) #for image_read()
library(factoextra) #for fviz_eig()
library(ggplot2)
library(gridExtra) #for grid.arrange()
library(abind) #for abind()
library(knitr) #for kable()

Image Data Preparation

In this paper I will use the image of a sunflower from Kaggle.

sunflower <- readJPEG('data_input/sunflower.jpg')
image_read(sunflower)

The jpeg package is used to also convert the image into a matrix representation. Let’s now take a look at the dimension of our matrix representation.

dim(sunflower)
#> [1] 330 500   3

PCA

What is needed to be done now is to represent the image as a three 330 x 500 matrices array with each matrix corresponding to the RGB color value scheme and then extract the individual color value matrices to perform PCA on each.

red <- sunflower[,,1]
green <- sunflower[,,2]
blue <- sunflower[,,3]

PCA targets features with higher variance. Therefore scaling and centering of our data would not be necessary for image compression. This is because if we scale and the scaling coefficient is > 1, that feature will have more influence than it would have before. On the contrary, a coefficient < 1 will also mean less influence. In other words, as long as the parameters have same orders, centering and scaling may not be necessary.

notes: I’ve tried using center and scale, but your compressed image will be darker than the original image

red.pca <- prcomp(red, center=F, scale.=F)
green.pca <- prcomp(green, center=F, scale.=F)
blue.pca <- prcomp(blue, center=F, scale.=F)

Then we put them together as a list to integrate it into the three dimensional table of the RGB.

list.sunflower.pca <- list(red.pca, green.pca, blue.pca)

Principal Components Representation

The next step is to observe the contribution of the eigen value and the percentage of variance on each PC for each color.

The eigenvalue is obtained by squaring the value of the standard deviation on pca$sdev. For example:

#red.pca standard deviation

red.pca$sdev[1:7]
#> [1] 11.989829  3.885665  2.140566  1.581951  1.445437  1.362872  1.225419
#red.pca eigen value

(red.pca$sdev[1:7])^2
#> [1] 143.756006  15.098395   4.582023   2.502569   2.089289   1.857419   1.501651

To make it easier to observe, we will plot those eigen values using fviz_eig(). In this case I am only displaying just seven(7) Principal Components (PC). The color indications in the plot below indicates the eigen values for the principal components for all the RGB components.

f1 <- fviz_eig(red.pca, choice = 'eigenvalue', main = "Red", barfill = "red", ncp = 7, addlabels = T)
f2 <- fviz_eig(green.pca, choice = 'eigenvalue', main = "Green", barfill = "green", ncp = 7, addlabels = T)
f3 <- fviz_eig(blue.pca, choice = 'eigenvalue', main = "Blue", barfill = "blue", ncp = 7, addlabels = T)

grid.arrange(f1, f2, f3, ncol=3)

Let us now take a look at the percentage of the explained variances in these principal components.

f11 <- fviz_eig(red.pca, main = "Red", barfill = "red", ncp = 7)
f22 <- fviz_eig(green.pca, main = "Green", barfill = "green", ncp = 7)
f33 <- fviz_eig(blue.pca, main = "Blue", barfill = "blue", ncp = 7)

grid.arrange(f11, f22, f33, ncol = 3)

It could be realized from the above scree plot that the first principal component for each color explains majority of the variances in the color component, explaining about 30% variance. We can also considered up to fourth principal. The fifth principal and the rest are obviously negligible since they explain very little or no variance at all.

To determine the effect of the number of PCs on the compression results, it will be explained in the Image Compression section.

Compression of the Image

#create/write JPEG
#pca$x -> value of each PC
#pca$rotation -> rotation matrix, includes eigen vector

for (i in c(10,15,30,60,90,120,150,180)) # number of PCA
  {new_image <- abind(red.pca$x[,1:i] %*% t(red.pca$rotation[,1:i]), 
                     green.pca$x[,1:i] %*% t(green.pca$rotation[,1:i]),
                     blue.pca$x[,1:i] %*% t(blue.pca$rotation[,1:i]),
                     along = 3)
  writeJPEG(new_image, paste0('Compressed_image_with_',i, '_components.jpg'))
}

# create a formula to plot image
image_plot <- function(path, plot_name) {
  require('jpeg')
  img <- readJPEG(path)
  d <- dim(img)
  plot(0,0,xlim=c(0,d[2]),ylim=c(0,d[2]),xaxt='n',yaxt='n',xlab='',ylab='',bty='n')
  title(plot_name, line = -0.5)
  rasterImage(img,0,0,d[2],d[2])
}

# plot image using formula
par(mfrow = c(3,3), mar = c(0,0,1,0))
for (i in c(10,15,30,60,90,120,150,180)) {
  image_plot(paste0('Compressed_image_with_',i, '_components.jpg'), 
             paste0(round(i,0), ' Components'))
}

As it can be seen from the images above, the quality of the images kept increasing as the number of principal components increased. It is also important to note that, the last image generated is only representing 180 principal components of the original image which contained about 330 components. This is about half of the original components. But we can see the quality in both images are almost the same.

Now, lets compare the sizes in kilobytes to know if there is a significant change in the sizes of these images to ascertain our point that, PCA for image compression helps to save disk spaces without sacrificing quality since it only removes less significant components.

#create empty table
table <- matrix(0,9,4)
colnames(table) <- c("Number of components", 
                     "Image size (kilobytes)", 
                     "Saved Disk Space (kilobytes)", 
                     "Saved Disk Space (%)")
table[,1] <- c(10,15,30,60,90,120,150,180,"Original")

#input origin image info
table[9,2:4] <- round(c(file.info('data_input/sunflower.jpg')$size/1024, 0, 0),2)

#input the others image compression info
for (i in c(1:8)) {
  path <- paste0('Compressed_image_with_',table[i,1], '_components.jpg')
  table[i,2] <- round(file.info(path)$size/1024,2)
  table[i,3] <- round(as.numeric(table[9,2]) - as.numeric(table[i,2]),2)
  table[i,4] <- round((as.numeric(table[i,3])/as.numeric(table[9,2]))*100, 2)
}

kable(table)
Number of components Image size (kilobytes) Saved Disk Space (kilobytes) Saved Disk Space (%)
10 24.1 115.79 82.77
15 27.4 112.49 80.41
30 32.58 107.31 76.71
60 36.67 103.22 73.79
90 38.71 101.18 72.33
120 39.19 100.7 71.99
150 39.33 100.56 71.89
180 39.01 100.88 72.11
Original 139.89 0 0

Conclussion

Image compression with principal component analysis helped to save disk space of about 80% with little or no loss in image quality. Not only has it saved disk space but it has made it easier and efficient to transmit these images between different sets of locations.