Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  1. Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)
Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools
data(USairpollution)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
svd_air <- svd(scale(USairpollution))

D_air <- svd_air$d

D_air %>% round(2)
[1] 10.45  7.78  7.47  5.97  3.72  2.00  1.01

It looks like a list of decreasing numbers. Potentially eigenvalues.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

U_air <- svd_air$u
V_air <- svd_air$v


recomposed_x <- U_air %*% diag(D_air) %*% t(V_air) %>% round(2)

plot(as.matrix(scale(USairpollution)), recomposed_x); abline(0,1)

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

X <- scale(USairpollution)

k <- 2
head(Xtilde2 <- U_air[,1:k] %*% diag(D_air[1:k]) %*% t(V_air[,1:k])) %>% round(1)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  0.1 -0.1 -0.2 -0.3  0.0  0.6  0.6
[2,] -1.1  0.8 -0.4 -0.1 -0.6 -1.7 -2.3
[3,] -0.2  0.1 -0.5 -0.5 -0.1  0.4  0.3
[4,]  0.4 -0.3  0.4  0.4  0.2  0.0  0.2
[5,]  0.5 -0.4  0.1  0.0  0.3  1.0  1.3
[6,] -0.4  0.2 -0.9 -0.9 -0.2  0.9  0.7
cor(as.vector(X), as.vector(Xtilde2))
[1] 0.7783182
k <- 3
head(Xtilde3 <- U_air[,1:k] %*% diag(D_air[1:k]) %*% t(V_air[,1:k])) %>% round(1)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  0.0 -1.0 -0.6 -0.7  0.5 -0.1  0.7
[2,] -1.1  0.0 -0.7 -0.5 -0.2 -2.3 -2.2
[3,] -0.2  0.8 -0.2 -0.1 -0.4  0.9  0.2
[4,]  0.4  0.0  0.5  0.5  0.1  0.2  0.2
[5,]  0.5 -1.6 -0.4 -0.7  0.9  0.1  1.5
[6,] -0.4  0.3 -0.8 -0.9 -0.2  0.9  0.7
cor(as.vector(X), as.vector(Xtilde3))
[1] 0.897252
k <- 4
head(Xtilde4 <- U_air[,1:k] %*% diag(D_air[1:k]) %*% t(V_air[,1:k])) %>% round(1)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  0.5 -1.2 -0.6 -0.9 -0.5 -0.3  0.9
[2,] -1.0 -0.1 -0.7 -0.5 -0.5 -2.3 -2.2
[3,] -0.3  0.8 -0.2 -0.1 -0.3  0.9  0.1
[4,]  0.4  0.0  0.5  0.5  0.0  0.2  0.2
[5,] -0.2 -1.3 -0.4 -0.5  2.3  0.4  1.3
[6,]  0.4 -0.1 -0.9 -1.1 -1.9  0.6  0.9
cor(as.vector(X), as.vector(Xtilde4))
[1] 0.9656544

We need at least 4 dimensions to yield a correlation of at least 0.9.

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

  • The eigenvectors of \(\Sigma\) equal the columns of \(V\)
  • The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)
cov(X)
                SO2        temp        manu       popul        wind      precip
SO2      1.00000000 -0.43360020  0.64476873  0.49377958  0.09469045  0.05429434
temp    -0.43360020  1.00000000 -0.19004216 -0.06267813 -0.34973963  0.38625342
manu     0.64476873 -0.19004216  1.00000000  0.95526935  0.23794683 -0.03241688
popul    0.49377958 -0.06267813  0.95526935  1.00000000  0.21264375 -0.02611873
wind     0.09469045 -0.34973963  0.23794683  0.21264375  1.00000000 -0.01299438
precip   0.05429434  0.38625342 -0.03241688 -0.02611873 -0.01299438  1.00000000
predays  0.36956363 -0.43024212  0.13182930  0.04208319  0.16410559  0.49609671
            predays
SO2      0.36956363
temp    -0.43024212
manu     0.13182930
popul    0.04208319
wind     0.16410559
precip   0.49609671
predays  1.00000000
eigenSigma <- eigen(cov(X))
(eigenSigma$vectors + V_air) %>% round(2)
     [,1]  [,2]  [,3]  [,4] [,5]  [,6]  [,7]
[1,]    0 -0.17 -0.03  0.81    0 -0.37 -0.30
[2,]    0  0.18 -1.35 -0.37    0 -1.22  0.05
[3,]    0  0.45 -0.53 -0.05    0  0.09  1.49
[4,]    0  0.56 -0.69 -0.23    0  0.18 -1.30
[5,]    0 -0.11  0.62 -1.72    0 -0.30 -0.03
[6,]    0 -1.25 -0.98 -0.37    0  1.11  0.02
[7,]    0 -1.36  0.22  0.22    0 -1.01 -0.02
(eigenSigma$vectors - V_air) %>% round(2)
      [,1] [,2] [,3] [,4]  [,5] [,6] [,7]
[1,]  0.98    0    0    0  1.46    0    0
[2,] -0.63    0    0    0  0.32    0    0
[3,]  1.08    0    0    0 -0.33    0    0
[4,]  0.98    0    0    0 -0.70    0    0
[5,]  0.50    0    0    0  0.54    0    0
[6,]  0.00    0    0    0  0.32    0    0
[7,]  0.52    0    0    0 -0.88    0    0

The eigenvectors of \(\Sigma\) equal the columns of \(V\). Some columns have their signs flipped but that’s ok.

n <- 41
(eigenSigma$values - svd(X)$d^2/(41-1)) %>% round(1)
[1] 0 0 0 0 0 0 0

The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\), proved by showing the difference is equal to 0.

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)
Warning: package 'magick' was built under R version 4.4.3
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
recompose <- \(k) {
  recompose_image <- mysteryU4[,1:k] %*% diag(mysteryD4[1:k]) %*% t(mysteryV4[,1:k])
  R <- max(recompose_image)-min(recompose_image)
  recomposed_scaled_data <- (recompose_image-min(recompose_image))/R
  recomposed_image <- (recomposed_scaled_data
                       %>% as.raster
                       %>% image_read
                       )
  return(recomposed_image)
}
recompose(4)

4 dimensions is not enough to make out the character.

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

recompose <- \(k) {
  recompose_image <- mysteryU10[,1:k] %*% diag(mysteryD10[1:k]) %*% t(mysteryV10[,1:k])
  R <- max(recompose_image)-min(recompose_image)
  recomposed_scaled_data <- (recompose_image-min(recompose_image))/R
  recomposed_image <- (recomposed_scaled_data
                       %>% as.raster
                       %>% image_read
                       )
  return(recomposed_image)
}
recompose(10)

I can tell that it’s a man with glasses, but nothing really more.

recompose <- \(k) {
  recompose_image <- mysteryU50[,1:k] %*% diag(mysteryD50[1:k]) %*% t(mysteryV50[,1:k])
  R <- max(recompose_image)-min(recompose_image)
  recomposed_scaled_data <- (recompose_image-min(recompose_image))/R
  recomposed_image <- (recomposed_scaled_data
                       %>% as.raster
                       %>% image_read
                       )
  return(recomposed_image)
}
recompose(50)

Is that Professor Iverson? We can see the full image pretty clearly when we use 50 dimensional data.

C)

How many numbers need to be stored in memory for each of the following:

  • A full \(700\times 600\) image?
700*600
[1] 420000

420,000 numbers

  • A 4-dimensional approximation?
(700*4+4+4^2)
[1] 2820

2,820 numbers

  • A 10-dimensional approximation?
(700*10+10+10^2)
[1] 7110

7,110 numbers

  • A 50-dimensional approximation?
(700*50+50+50^2)
[1] 37550

37,550 numbers