Introduction

Nowadays, data is collected almost at a permanent pace, simultaneously creating huge, multi-dimensional datasets. On the other hand, the vast number of variables can make a reliable analysis very complex to perform. For this reason, in many cases, dimensionality reduction is necessary and recommended to be applied.

Principal Component Analysis (PCA) is one of the most popular method to deal with mentioned issue. It is used in order to reduce the data dimension by creating its simpler version with minimum loss of information. Over time, however, other aplications of PCA were also found. It is applied, among others, in fields such as computer vision (including face recognition), theory of system control, pattern recognition or signal processing.

In the paper below I will look at yet another use of this method, which is image compression. Despite describing and presenting an example of usage of this concept, I will also show how important this application is. Finally, I will present a new approach to image compression which aims to use PCA in creating works of art in the impressionism and blur styles.

PCA

Principal Component Analysis (PCA) is a statistical technique of obtaining the most relevant features from a dataset. Mathematically it is defined as an orthogonal linear transformation which converts a potentially correlated set of variables into uncorrelated set of variables called principal components (PCs). The number of principal components cannot exceed the inital number of variables. What is more, they are sorted in terms of maintained variance, i.e. the first principal component captures the maximum possbile variance in the data, and every next component captures the next highest variance.

As a result of that approach, it is possible to safely compress the data, i.e. the redundant number of variables can be reduced losing as little information as possible. Furthermore, the new features emphasize more significantly the differences and similarities between data.

Image representation

Using readJPEG() function from jpeg package, one can convert the image from .jpg form into its 3-level matrix representation. In order to show basic PCA transformation and later image compression, I will use the following image of a squirrel.

library(factoextra)
library(jpeg)
library(magick)
sqrl <- readJPEG('squirrel.jpg')
image_read(sqrl)
Image source
dim(sqrl)
## [1] 485 728   3

As one can see, the above image is represented as three-level matrix of size 485x728. Each level corresponds to one of the primary colors (red, green, blue sequentially). Thus, in order to perform PCA, it is necessary to perform it on each level seperately. For this reason, I split the 3-level array into two-dimensional matrices named by the colors they correspond to.

red <- sqrl[,,1]
green <- sqrl[,,2]
blue <- sqrl[,,3]

Worth noticing is that in case of image compression we are not going to analyze the interpretation part of PCA transformation. The influence of particular variables on each PC does not really matter here. Mostly, we are interested in the number of principal components and the variance that they jointly capture.

For this reason, there is no need to center the data. What is more, it is even strongly not recommended to do while analyzing images. This is because after centering data and performing PCA, the image matrix simply would not have the correct RGB values. Each variable probably has different mean that must be substracted in centering and it all would just lead to a huge mess in data.

pca_red <- prcomp(red, center = FALSE)
pca_green <- prcomp(green, center = FALSE)
pca_blue <- prcomp(blue, center = FALSE)

What is this compression really about?

Image compression using PCA removes unnecessary pixels from each color matrix. However, in order not to distort the image, other values should be inserted in their place - preferably different pixels already used in this image. Needless to say, such elimination cannot be overdone at the cost of image quality. Therefore, apart from lowering the dimension of the variables, one should also take into account the graphical representation of the data.

The number of principal components

There are several approaches on how to choose the appropriate number of principal components in PCA. Based on Kaiser-Guttman criterion, one should retain components with eigenvalues greater than 1. The results of this method can be checked on the basis of the graphs below, where for each color several highest eigenvalues are presented in descending order.

library(gridExtra)

f1 <- fviz_eig(pca_red, choice = 'eigenvalue', main = "Red", barfill = "red", ncp = 7, addlabels = TRUE)
f2 <- fviz_eig(pca_green, choice = 'eigenvalue', main = "Green", barfill = "green", ncp = 7, addlabels = TRUE)
f3 <- fviz_eig(pca_blue, choice = 'eigenvalue', main = "Blue", barfill = "blue", ncp = 7, addlabels = TRUE)

grid.arrange(f1, f2, f3, ncol = 3)

As we can see, for each color the eigenvalue of one component is dominantly greater than the others. Moreover, in case of red color eigenvalues of 6 components are above 1, in case of green color of 5 components, but in case of blue color there is only one eigenvalue greater than 1.

The other well-known approach is to choose the number of components based on the percentage of jointly explained variance. It is frequently said that it is desirable when it exceeds 70%, or even 80%.

The first three graphs below are scree plots of the proportion of explained variance for each color, while next three graphs represent cumulative sums of those proportions of variance.

f1 <- fviz_eig(pca_red, main = "Red", barfill = "red", ncp = 7)
f2 <- fviz_eig(pca_green, main = "Green", barfill = "green", ncp = 7)
f3 <- fviz_eig(pca_blue, main = "Blue", barfill = "blue", ncp = 7)

grid.arrange(f1, f2, f3, ncol = 3)

par(mfrow = c(1,3))

plot(get_eigenvalue(pca_red)[1:10,3], type = "b", main = "Red", xlab = "Dimensions", ylab = "Cumulative percentage of explained variance", col = "red", ylim = c(50,100), panel.first = grid())
plot(get_eigenvalue(pca_green)[1:10,3],type = "b", main = "Green", xlab = "Dimensions", ylab = "Cumulative percentage of explained variance", col = "green", ylim = c(50,100), panel.first = grid())
plot(get_eigenvalue(pca_blue)[1:10,3],type = "b", main = "Blue", xlab = "Dimensions", ylab = "Cumulative percentage of explained variance", col = "blue", ylim = c(50,100), panel.first = grid())

In each case, the first principal component explains over 60% of the variance in the data. What is more, for red and green it is almost 70%. For red and green color the cumulative percentage of explained variance grows similarly - using first three principal components it exceeds 80% while using first 10 components over 90% of variance is explained. This cumulative percentage is slightly lower for blue color. It is generally smaller than the previous two by about 10 percentage points.

The compressed images

Combining the information provided by the above-mentioned criteria for selecting the number of principal components, I decided to present the squirrel image compression using 6 components for each color level. This number allows to explain more than 85% of the variance for red and green color, and more than 75% for blue. What is more, in case of red color, all those components have eigenvalues greater than 1.

In the below formula I use abind() function from abind package, which is analogous to rbind() and cbind() functions with the proviso that it combines multi-dimensional arrays.

library(abind)
library(ggplot2)
new_image <- abind(pca_red$x[,1:6] %*% t(pca_red$rotation[,1:6]),
                   pca_green$x[,1:6] %*% t(pca_green$rotation[,1:6]),
                   pca_blue$x[,1:6] %*% t(pca_blue$rotation[,1:6]),
                   along = 3)

plot(image_read(new_image))
title("Compressed image with 6 Components")

The obtained image may be a bit of a surprise to some. After all, the choice of the number of components was carried out in accordance with theoretical criteria, taking into account the high value of the explained variance. On the other hand, it is worth emphasizing that we have lowered the dimension from 485 to just 6, that is a lot. Thus, the image lacks many significant values. Moreover, as already mentioned, in the case of image compression, the visual evaluation of image quality is much more important than the theoretical evaluation of eigenvalues.

In order to find visually satisfying result, 8 images are presented below with the increasing number of principal components.

for (i in c(10,15,20,30,40,50,60,80)) {
  new_image <- abind(pca_red$x[,1:i] %*% t(pca_red$rotation[,1:i]),
                     pca_green$x[,1:i] %*% t(pca_green$rotation[,1:i]),
                     pca_blue$x[,1:i] %*% t(pca_blue$rotation[,1:i]),
                     along = 3)
  writeJPEG(new_image, paste0('Compressed_image_with_',i, '_components.jpg'))
}
# Auxiliary function for plotting the images
image_plot <- function(path, plot_name) {
  require('jpeg')
  img <- readJPEG(path)
  d <- dim(img)
  plot(0,0,xlim=c(0,d[2]),ylim=c(0,d[2]),xaxt='n',yaxt='n',xlab='',ylab='',bty='n')
  title(plot_name, line = -7)
  rasterImage(img,0,0,d[2],d[1])
}

par(mfrow = c(1,2), mar = c(0,0,0,0))
for (i in c(10,15,20,30,40,50,60,80)) {
  image_plot(paste0('Compressed_image_with_',i, '_components.jpg'), 
             paste0(round(i,0), ' Components'))
}

By using only up to 20 principal components, the obtained image remains blurry. Later, as their number increases, we get images of better and better quality until the last one, when for 80 components the image seems to be of a quality quite similar to the original image. This last compression does a great job by removing unneeded pixels and creating a visually satisfying image. What is more, as noted below, for each color by using 80 components we capture at least 95% of explained variance.

## 80 principal components capture:
## 98.78406 % of joint explained variance for red color.
## 98.22344 % of joint explained variance for green color.
## 95.72043 % of joint explained variance for blue color.

What is the purpose of it all?

We live in an era of constant information flow. Huge amounts of data, including visualization, are shared over the Internet from various sources and devices. High-quality images require much space and long transmission time. Representing images in the form of multidimensional data spaces is a real challenge when sharing and downloading files. Efficient in real time image transmission has become an extremely important necessity for information technology.

The basic concept of dealing with this problem is to reduce the number of bits used in the image to only those that are most needed to maintain image quality. It is for this purpose, among others, that the PCA can then be carried out. Using it, one has to find a compromise between the file size and its visual quality.

The table below shows the size (in bits) and memory ratio for each of the compressed images presented above. Memory ratio is defined as the memory required for the compressed image divided by the memory required for the original image.

library(knitr)

table <- matrix(0,9,3)
colnames(table) <- c("Number of components", "Image size", "Memory ratio")
table[,1] <- c("original image",80,60,50,40,30,20,15,10)
table[1,2:3] <- c(file.info('squirrel.jpg')$size/1000, 1)
for (i in c(2:9)) {
  path <- paste0('Compressed_image_with_',table[i,1], '_components.jpg')
  table[i,2] <- file.info(path)$size/1000
  table[i,3] <- round(as.numeric(table[i,2]) / as.numeric(table[1,2]),3)
}

kable(table)
Number of components Image size Memory ratio
original image 84.618 1
80 48.76 0.576
60 46.574 0.55
50 45.163 0.534
40 43.203 0.511
30 40.484 0.478
20 36.55 0.432
15 34.18 0.404
10 30.563 0.361

Compressed images with the fewest principal components are almost three times smaller (in terms of file size) than the original image. On the other hand, the image created with the use of 80 components whose quality was found to be satisfactory, takes almost twice less memory space than the original image. While this may seem like a minor difference, it is actually very significant.

Art with PCA

As I announced in the introduction, I will try in this paper to use the PCA to create from existing images several “works of art”. Obviously, it is a kind of fun, but on the other hand it shows an interesting and somehow original usage of image compression.

Blurred images

Blurring the image, although simple in general, often leads to interesting visualizations, especially with the right frame and lighting. Although the methods of blurring may be different, if one uses PCA for this, the blurring will be rather related to the stretching the image colors.

Three paintings with their compressed “works of art” are presented below. The number of principal components was selected exogenously by the author with care for the best artistic effect.

Street view

street <- readJPEG('street.jpg')

red_street <- street[,,1]
green_street <- street[,,2]
blue_street <- street[,,3]

pca_red_street <- prcomp(red_street, center = FALSE)
pca_green_street <- prcomp(green_street, center = FALSE)
pca_blue_street <- prcomp(blue_street, center = FALSE)

compressed_street <- abind(pca_red_street$x[,1:10] %*% t(pca_red_street$rotation[,1:10]),
                           pca_green_street$x[,1:10] %*% t(pca_green_street$rotation[,1:10]),
                           pca_blue_street$x[,1:10] %*% t(pca_blue_street$rotation[,1:10]),
                           along = 3)

par(mfrow = c(1,2), mar = c(0,0,3,0))

plot(image_read(compressed_street))
title("Compressed image \n (10 Components)")

plot(image_read(street))
title("Original image")
Original image source

City view

city <- readJPEG('city.jpg')

red_city <- city[,,1]
green_city <- city[,,2]
blue_city <- city[,,3]

pca_red_city <- prcomp(red_city, center = FALSE)
pca_green_city <- prcomp(green_city, center = FALSE)
pca_blue_city <- prcomp(blue_city, center = FALSE)

compressed_city <- abind(pca_red_city$x[,1:12] %*% t(pca_red_city$rotation[,1:12]),
                         pca_green_city$x[,1:12] %*% t(pca_green_city$rotation[,1:12]),
                         pca_blue_city$x[,1:12] %*% t(pca_blue_city$rotation[,1:12]),
                         along = 3)
plot(image_read(compressed_city))
title("Compressed image \n (12 Components)")

plot(image_read(city))
title("Original image")

Original image source

Nature view

nature <- readJPEG('nature.jpg')

red_nature <- nature[,,1]
green_nature <- nature[,,2]
blue_nature <- nature[,,3]

pca_red_nature <- prcomp(red_nature, center = FALSE)
pca_green_nature <- prcomp(green_nature, center = FALSE)
pca_blue_nature <- prcomp(blue_nature, center = FALSE)

compressed_nature <- abind(pca_red_nature$x[,1:15] %*% t(pca_red_nature$rotation[,1:15]),
                        pca_green_nature$x[,1:15] %*% t(pca_green_nature$rotation[,1:15]),
                        pca_blue_nature$x[,1:15] %*% t(pca_blue_nature$rotation[,1:15]),
                        along = 3)
plot(image_read(compressed_nature))
title("Compressed image \n (15 Components)")

plot(image_read(nature))
title("Original image")
Original image source

Next page (Part 2)

This paper had to be split into two parts due to too large memory volume.