Taking pictures is part of humans life, but there’s a problem when our computer or our devices do not have enough space to store them due to quality. Nowadays professional cameras or even phones take pictures with high resolution. The higher the pixels per inch, the higher the quality hence bigger the file size.
For example, in Samsung Galaxy Ultra taking a picture of 12000x9000 takes around 14 Megabytes of disk space.
What could we do to save disk space without compromising quality of the image?
This is where Principal Component Analysis (PCA) comes in, this technique allows us to make a transition between a high-dimensional dataset into a reduced one (in this case color components of the image), without losing the relationships between features. So PCA means that we will keep the principal features (color components) of the data set (image), that’s why the name is Principal Component Analysis.
In the PCA we just pass from many features (colors) to less (colors), without dropping what makes a dataset representative (quality of the image).
Next we will use “jpeg” and “magick” packages to import and read the following image and proceed to the PCA.
library("jpeg")
library("magick")
## Linking to ImageMagick 6.9.12.3
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
dog <- readJPEG("/Users/ayaxdiaz/Desktop/UL/Dimension Reduction/Chihuahua_glass.jpeg")
image_read(dog)
## [1] 1200 1200 3
As we clearly see we have a matrix representation of 1250, 1000 and 3 which represents the dimension of this image.
Every image has an RGB (red, green and blue) component, with PCA we will proceed extract and know the amount of each color component present in this Chihuahua dog image.
red <- dog[,,1]
green <- dog[,,2]
blue <- dog[,,3]
This image won’t need any form of scaling and centering because they have same order. If centering and scaling is done, there would be a probability of having scaling coefficient greater than 1, this will cause that feature to have more influence than it previously had. Vice-versa will be also true.
red.pca <- prcomp(red, center=FALSE, scale.=FALSE)
green.pca <- prcomp(green, center=FALSE, scale.=FALSE)
blue.pca <- prcomp(blue, center=FALSE, scale.=FALSE)
####list.dog.pca <- list(red.pca, green.pca, blue.pca)
Next graph will show 10 principal components or dimensions for the three RGB with each of their eigenvalues.
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("gridExtra")
library("ggplot2")
f1 <- fviz_eig(red.pca, choice = 'eigenvalue', main = "Red", barfill = "red", ncp = 10, addlabels = TRUE)
f2 <- fviz_eig(green.pca, choice = 'eigenvalue', main = "Green", barfill = "green", ncp = 10, addlabels = TRUE)
f3 <- fviz_eig(blue.pca, choice = 'eigenvalue', main = "Blue", barfill = "blue", ncp = 10, addlabels = TRUE)
grid.arrange(f1, f2, f3, ncol=3)
Now we will proceed to visualize the same graph but as a matter of percentage to realize the variance between each of the RGB colors, this will help us to discard the less representative tonalities of each color.
In other words, this means that if a picture uses different tonalities of “lets say red”, we will analyze how many different types of red we have and we will keep the most representative ones and discard the others. The same happens with green and blue.
Lets take a look at the graph and analyze it.
f11 <- fviz_eig(red.pca, main = "Red", barfill = "red", ncp = 10)
f22 <- fviz_eig(green.pca, main = "Green", barfill = "green", ncp = 10)
f33 <- fviz_eig(blue.pca, main = "Blue", barfill = "blue", ncp = 10)
grid.arrange(f11, f22, f33, ncol = 3)
From what we see in the graph, the first component of the red color is explained about 89% of the variance, the second component here is also important while the others doesn’t vary that much between them so they are “not that important”. So we can say that we can discard them.
About green, first component explains about 87.5% of the color of the image, the second component is also important and the others appear to be not as important as the first and second.
With blue occurs the same first component is present in the picture around 86%, second component (or tonality of blue) is also important but from third to 10th seems to be not that different.
Now we will proceed to take different number of components to see how our test image looks like. While reducing the number of components we reduce partially the quality of the image and therefore the size of the file. With the following codes we will try to take into consideration 10, 15, 30, 60, 80, 120, 180, 210 components of each red, blue and green colors.
library(abind)
library(ggplot2)
for (i in c(10,15,30,60,80,120,180,210)) {
test_image <- abind(red.pca$x[,1:i] %*% t(red.pca$rotation[,1:i]),
green.pca$x[,1:i] %*% t(green.pca$rotation[,1:i]),
blue.pca$x[,1:i] %*% t(blue.pca$rotation[,1:i]),
along = 3)
writeJPEG(test_image, paste0('Compressed_image_with_',i, '_components.jpg'))
}
test_plot <- function(path, plot_name) {
require('jpeg')
img <- readJPEG(path)
d <- dim(img)
plot(0,0,xlim=c(0,d[2]),ylim=c(0,d[2]),xaxt='n',yaxt='n',xlab='',ylab='',bty='n')
title(plot_name, line = -0.5)
rasterImage(img,0,0,d[2],d[2])
}
par(mfrow = c(1,2), mar = c(0,0,1,1))
for (i in c(10,15,30,60,80,120,180,210)) {
test_plot(paste0('Compressed_image_with_',i, '_components.jpg'),
paste0(round(i,0), ' Components'))}
After analyzing the images we can clearly see that the less the components taken into consideration, the least visualizable the image becomes and the size of the image becomes smaller. When we incorporate more components the image becomes more visualizable and the size of the file becomes bigger.
In the following table we will represent the number of components of the previous images plus the size of each image in kilobytes and how many kilobytes we could be saving by using different number of components.
library(knitr)
table <- matrix(0,9,3)
colnames(table) <- c("Number of components", "Image size (kilobytes)", "Saved Disk Space (kilobytes)")
table[,1] <- c(10,15,30,60,80,120,180,210, "Original Chihuahua image")
table[9,2:3] <- round(c(file.info('Chihuahua_glass.jpeg')$size/1024, 0),2)
for (i in c(1:8)) {
path <- paste0('Compressed_image_with_',table[i,1], '_components.jpg')
table[i,2] <- round(file.info(path)$size/1024,2)
table[i,3] <- round(as.numeric(table[9,2]) - as.numeric(table[i,2]),2)
}
kable(table)
| Number of components | Image size (kilobytes) | Saved Disk Space (kilobytes) |
|---|---|---|
| 10 | 68.01 | 104.91 |
| 15 | 74.38 | 98.54 |
| 30 | 80.4 | 92.52 |
| 60 | 88.33 | 84.59 |
| 80 | 91.99 | 80.93 |
| 120 | 97.11 | 75.81 |
| 180 | 102.08 | 70.84 |
| 210 | 104.11 | 68.81 |
| Original Chihuahua image | 172.92 | 0 |
Image reduction with PCA will help to reduce the size of images files without compromising quality. In our case we saved 40% of kilobytes when preserving 210 components of the original image also we were able to preserve the quality.
Lecture materials