Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)

Loading required package: tools

data(USairpollution)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

X <- as.matrix(scale(USairpollution, center = TRUE, scale = FALSE))
sv <- svd(X)
D <- diag(sv$d)
D

         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]    [,7]
[1,] 5053.601   0.0000   0.0000  0.00000  0.00000  0.00000 0.00000
[2,]    0.000 769.7282   0.0000  0.00000  0.00000  0.00000 0.00000
[3,]    0.000   0.0000 167.5661  0.00000  0.00000  0.00000 0.00000
[4,]    0.000   0.0000   0.0000 90.55428  0.00000  0.00000 0.00000
[5,]    0.000   0.0000   0.0000  0.00000 68.32413  0.00000 0.00000
[6,]    0.000   0.0000   0.0000  0.00000  0.00000 21.96092 0.00000
[7,]    0.000   0.0000   0.0000  0.00000  0.00000  0.00000 7.61237

The matrix is a 7 × 7 diagonal matrix that contains the singular values of the centered U.S. air pollution data. Only the main diagonal has nonzero entries and all other elements are zero. The diagonal values represent the strength or variance captured by each corresponding singular vector. These singular values are sorted in descending order, showing that the first few components capture most of the variation in the data, while the later ones contribute much less.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

X <- as.matrix(scale(USairpollution, center = TRUE, scale = FALSE))


sv <- svd(X)
U <- sv$u
D <- diag(sv$d)
V <- sv$v

X_recon <- U %*% D %*% t(V)

plot(as.vector(X), as.vector(X_recon),
     xlab = "Entries of X",
     ylab = "Entries of UDV^T",
     main = "Verification of X = UDV^T",
     pch = 16, col = "blue")
abline(0, 1, col = "red", lwd = 2)

Since all points lie precisely on the red 45° line, this confirms that the singular value decomposition perfectly reconstructs the original data.

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

X  <- as.matrix(scale(USairpollution, center = TRUE, scale = FALSE))
d  <- svd(X)$d

cumprop <- sqrt(cumsum(d^2) / sum(d^2))
k_min   <- which(cumprop >= 0.9)[1]

k_min

[1] 1

cumprop

[1] 0.9878146 0.9992071 0.9997438 0.9999005 0.9999897 0.9999989 1.0000000

The fewest number of dimensions required to achieve a correlation of at least 0.9 is 1. This shows that most of the variability in U.S. air pollution data is explained by the first principal direction, with very little additional information gained from higher dimensions.

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

The eigenvectors of \(\Sigma\) equal the columns of \(V\)
The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)

X <- as.matrix(scale(USairpollution, center = TRUE, scale = FALSE))
sv <- svd(X); U <- sv$u; d <- sv$d; V <- sv$v
n <- nrow(X)


Sigma <- cov(X)


eg <- eigen(Sigma)    


val_cmp <- cbind(eig = eg$values, d2_over = d^2/(n-1))
print(val_cmp)

              eig      d2_over
[1,] 6.384720e+05 6.384720e+05
[2,] 1.481204e+04 1.481204e+04
[3,] 7.019599e+02 7.019599e+02
[4,] 2.050019e+02 2.050019e+02
[5,] 1.167047e+02 1.167047e+02
[6,] 1.205705e+01 1.205705e+01
[7,] 1.448704e+00 1.448704e+00

cat("max |eigenvals - d^2/(n-1)| =",
    max(abs(eg$values - d^2/(n-1))), "\n")

max |eigenvals - d^2/(n-1)| = 9.313226e-10

s <- sign(diag(t(V) %*% eg$vectors))
V_aligned <- eg$vectors %*% diag(s)
cat("max |V - aligned_eigvecs| =", max(abs(V - V_aligned)), "\n")

max |V - aligned_eigvecs| = 5.02709e-13

Based upon the code outputs, both statements are upheld and are true. SVD and eigen-decomposition describe the same structure of the data. SVD works directly on the data matrix, while eigen-decomposition works on its covariance matrix.

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)

Linking to ImageMagick 6.9.13.29
Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
Disabled features: fftw, ghostscript, x11

show_svd_image <- function(u, d, v) {
  
  X_approx <- u %*% diag(d) %*% t(v)
  
  
  X_scaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))
  
 
  img_raster <- as.raster(X_scaled, max = 1)
  
 
  plot(as.raster(img_raster))
  
  invisible(X_scaled)
}


show_svd_image(mysteryU4, mysteryD4, mysteryV4)

This is not enough to show who the villain is, as the image is blurred and distorted.

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')



show_svd_image <- function(u, d, v) {
  X_approx <- u %*% diag(d) %*% t(v)
  X_scaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))
  plot(as.raster(X_scaled, max = 1))
}


show_svd_image(mysteryU10, mysteryD10, mysteryV10)

show_svd_image(mysteryU50, mysteryD50, mysteryV50)

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

At K = 10, I am able to see the formation of the outline of the “mystery villain’s” face and also some of the outline of the features (eyes, nose, mouth). Nonetheless, the person still isn’t recognizable at this point. However, at k = 50, I can fully see the description and features of the “mystery villain”, as well the background of the image as well. So, k = 50 is the point in which I can tell who the villain is.

C)

How many numbers need to be stored in memory for each of the following:

A full \(700\times 600\) image?

700 x 600 = 420,000 numbers

SVD approximation = 700k + 600k + k = k(700 + 600 + 1) = 1301k
A 4-dimensional approximation?

1301(4) = 5,204 numbers
A 10-dimensional approximation?

1301(10) = 13,010 numbers
A 50-dimensional approximation?

1301(50) = 65,050 numbers