Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  1. Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.3
library(HSAUR2)
Warning: package 'HSAUR2' was built under R version 4.4.3
data(USairpollution)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

X <- as.matrix(USairpollution)




components <- svd(X)
U <- components$u
D <- components$d
V <- components$v


D
[1] 7051.94936  931.12109  540.46297   92.70909   85.23724   52.94650   10.14090

D is a 7*7 diagonal matrix with those singular values on the diagonal in descending order, zeros elsewhere. the values are related to the amount each component captures.

The first singular value dominates the decomposition: highest variance.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

X<- (U %*% diag(D) %*% t(V)) %>% round(1)
k <- 2
Xtilde2 <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k]) %>% round(1)
k <- 3
Xtilde3 <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k]) %>% round(1)
plot(X, Xtilde2); abline(0,1)

plot(X, Xtilde3); abline(0,1)

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

cor(as.vector(X), as.vector(Xtilde2))
[1] 0.9965504
cor(as.vector(X), as.vector(Xtilde3))
[1] 0.9997697

The 2-dimensional approximation has a correlation of 0.997 with the original data, and the 3-dimensional approximation has 0.9998. Since both are above 0.9, 2 dimensions are enough.

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

  • The eigenvectors of \(\Sigma\) equal the columns of \(V\)
  • The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)
Xc <- scale(X, center = TRUE, scale = FALSE)
n <- nrow(Xc)

Sigma <- cov(Xc)


eig <- eigen(Sigma)
eig_values <- eig$values
eig_vectors <- eig$vectors

V %>% round(2)
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,] -0.03  0.00  0.21 -0.82 -0.54  0.03  0.01
[2,] -0.03  0.20  0.30  0.50 -0.67 -0.39 -0.11
[3,] -0.65 -0.71  0.26  0.09  0.00 -0.01  0.00
[4,] -0.75  0.57 -0.33 -0.07  0.01  0.01  0.00
[5,] -0.01  0.03  0.05  0.04 -0.04 -0.11  0.99
[6,] -0.02  0.13  0.24  0.24 -0.22  0.91  0.06
[7,] -0.07  0.34  0.79 -0.11  0.47 -0.12 -0.04
abs_cor <- abs(cor(eig_vectors, V))
round(abs_cor,4)
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]
[1,] 0.9951 0.2474 0.5325 0.0266 0.2576 0.0762 0.2329
[2,] 0.1117 0.9126 0.5421 0.0084 0.0158 0.0122 0.0216
[3,] 0.3182 0.3711 0.7105 0.2660 0.6257 0.1210 0.2230
[4,] 0.1377 0.0681 0.1900 0.8734 0.4676 0.0764 0.0615
[5,] 0.3476 0.0166 0.0914 0.4018 0.6279 0.6186 0.2044
[6,] 0.0891 0.1384 0.2481 0.2644 0.3739 0.7837 0.3973
[7,] 0.3319 0.0319 0.1870 0.1173 0.0215 0.3127 0.9475
D2_over_n1 <- (D^2) / (n-1)
round(cbind(eig_values, D2_over_n1),4)
      eig_values   D2_over_n1
[1,] 638471.9651 1243249.7439
[2,]  14812.0394   21674.6621
[3,]    701.9728    7302.5054
[4,]    205.0138     214.8744
[5,]    116.6895     181.6347
[6,]     12.0676      70.0833
[7,]      1.4480       2.5709

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

library(magick)
Warning: package 'magick' was built under R version 4.4.3
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
url=load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

recompose_villain <- function(U, D, V, k) {

  X_approx <- U[, 1:k] %*% diag(D[1:k]) %*% t(V[, 1:k])
  
 
  X_min <- min(X_approx)
  X_max <- max(X_approx)
  X_scaled <- (X_approx - X_min) / (X_max - X_min)
  

  img <- image_read(as.raster(X_scaled))
  return(img)
}

img4 <- recompose_villain(mysteryU4, diag(mysteryD4), mysteryV4, 4)

image_resize(img4, geometry_size_pixels(width = 200))

Using 4D approximation to this image is not enough to tell who the mystery villain is. It’s very blurry.

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

img10 <- recompose_villain(mysteryU10, mysteryD10, mysteryV10, 10)

image_resize(img10, geometry_size_pixels(width = 200))

Here is is a bit blurry but I can tell who it is.

img50 <- recompose_villain(mysteryU50, mysteryD50, mysteryV50, 50)

image_resize(img50, geometry_size_pixels(width = 200))

Yes, now it clear.

C)

How many numbers need to be stored in memory for each of the following:

  • A full \(700\times 600\) image?
  • A 4-dimensional approximation?
  • A 10-dimensional approximation?
  • A 50-dimensional approximation?

A full 700 x 600 image =700×600 = 420,000

k=4

4D= (700 × 4) + 4 + 4^2 = 2,820

k= 10

10D = (700 x 10) +10 +10^2 = 7,110

k = 50

50D= (700 x 50 ) 50 + 50^2 = 37,550