Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  1. Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(HSAUR2)
Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools

Attaching package: 'HSAUR2'

The following object is masked from 'package:tidyr':

    household
data(USairpollution)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

head(USairpollution)
            SO2 temp manu popul wind precip predays
Albany       46 47.6   44   116  8.8  33.36     135
Albuquerque  11 56.8   46   244  8.9   7.77      58
Atlanta      24 61.5  368   497  9.1  48.34     115
Baltimore    47 55.0  625   905  9.6  41.31     111
Buffalo      11 47.1  391   463 12.4  36.11     166
Charleston   31 55.2   35    71  6.5  40.75     148
svd_air <- svd(scale(USairpollution, center = TRUE, scale = FALSE))
x_a <- as.matrix(scale(USairpollution, center = TRUE, scale = FALSE))

U_air <- svd_air$u
D_air <- svd_air$d
V_air <- svd_air$v

D_air
[1] 5053.60055  769.72821  167.56610   90.55428   68.32413   21.96092    7.61237

This matrix is 7 values. Considering there are 7 dimensions in the original data set, we would expect to see these 7 values.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

svd_est <- U_air %*% diag(D_air) %*% t(V_air)

plot(x_a, svd_est); abline(0,1)

cor(as.vector(x_a), as.vector(svd_est))
[1] 1

There is agreement along the 0/1 line.

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

svd_est_temp <- U_air[,1:2] %*% diag(D_air[1:2]) %*% t(V_air[,1:2])

plot(x_a, svd_est_temp); abline(0,1)

cor(as.vector(x_a), as.vector(svd_est_temp))
[1] 0.9992071

When we use a dimension of 2, we can see that the correlation is almost one. The correlation is 0.9965. There are some technical difficulties when we try and go down to 1 dimension, so 2 will suffice.

## D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

  • The eigenvectors of \(\Sigma\) equal the columns of \(V\)
  • The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)
eigen_sigma <- eigen(cov(scale(USairpollution, center = TRUE, scale = FALSE)))
eigen_sigma$vectors
              [,1]         [,2]         [,3]        [,4]         [,5]
[1,]  0.0168607518 -0.099835625  0.208775573  0.95883106 -0.152191203
[2,] -0.0011417794  0.025814390 -0.071600745 -0.11014784 -0.477854201
[3,]  0.6968327936 -0.710249079 -0.067182201 -0.07319788 -0.009643654
[4,]  0.7170284512  0.692912523  0.056666935  0.04906669  0.010735457
[5,]  0.0004067530 -0.001011680  0.005386606 -0.01506609  0.025401917
[6,] -0.0004336922  0.001225937  0.265807619 -0.16261712 -0.832729325
[7,]  0.0028836950 -0.069155051  0.934279828 -0.18459052  0.232812295
             [,6]          [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023  1.136208e-03
[4,]  0.002915751 -5.682006e-05
[5,]  0.176541256 -9.838347e-01
[6,]  0.453206155  6.376788e-02
[7,] -0.183568735 -1.891454e-02
svd_air$v
              [,1]         [,2]         [,3]        [,4]         [,5]
[1,] -0.0168607518  0.099835625  0.208775573 -0.95883106  0.152191203
[2,]  0.0011417794 -0.025814390 -0.071600745  0.11014784  0.477854201
[3,] -0.6968327936  0.710249079 -0.067182201  0.07319788  0.009643654
[4,] -0.7170284512 -0.692912523  0.056666935 -0.04906669 -0.010735457
[5,] -0.0004067530  0.001011680  0.005386606  0.01506609 -0.025401917
[6,]  0.0004336922 -0.001225937  0.265807619  0.16261712  0.832729325
[7,] -0.0028836950  0.069155051  0.934279828  0.18459052 -0.232812295
             [,6]          [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023  1.136208e-03
[4,]  0.002915751 -5.682006e-05
[5,]  0.176541256 -9.838347e-01
[6,]  0.453206155  6.376788e-02
[7,] -0.183568735 -1.891454e-02
eigen_sigma$values
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
(svd_air$d)^2/(nrow(USairpollution)-1)
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00

The eigen vectors of Sigma and the columns of V match. They are negative version of each other, but they are vectors, so they are pointing in the same direction. Additionally, the eigenvalues of Sigma match the diagonals of D squared and then divided by n-1.

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('./Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)
Warning: package 'magick' was built under R version 4.4.3
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
recomposed_4 <- mysteryU4 %*% diag(mysteryD4) %*% t(mysteryV4)

R <- max(recomposed_4)-min(recomposed_4)
recomposed_scaled_4 <- (recomposed_4-min(recomposed_4))/R
(recomposed_scaled_4
  %>% as.raster
  %>% image_read
 )

The 4 dimensional representation will not be enough to properly view the image.

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('./Data/mystery_person_k10.Rdata')
load('./Data/mystery_person_k50.Rdata')

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

recomposed_10 <- mysteryU10 %*% diag(mysteryD10) %*% t(mysteryV10)

R_10 <- max(recomposed_10)-min(recomposed_10)
recomposed_scaled_10 <- (recomposed_10-min(recomposed_10))/R_10

(recomposed_scaled_10
  %>% as.raster
  %>% image_read
 )

recomposed_50 <- mysteryU50 %*% diag(mysteryD50) %*% t(mysteryV50)

R_50 <- max(recomposed_50)-min(recomposed_50)
recomposed_scaled_50 <- (recomposed_50-min(recomposed_50))/R_50

(recomposed_scaled_50
  %>% as.raster
  %>% image_read
 )

The villain is certain DSCI 326 Professor! You can start to make out this villain after 10, but only those who are very familiar with him could make him out. The image is clear with 50.

C)

How many numbers need to be stored in memory for each of the following:

  • A full \(700\times 600\) image?
  • A 4-dimensional approximation?
  • A 10-dimensional approximation?
  • A 50-dimensional approximation?
700*4 + 4 + 4^2
[1] 2820
700*10 + 10 + 10^2
[1] 7110
700*50 + 50 + 50^2
[1] 37550

As follows:

  • 4-dimensions: 2820
  • 10-dimensions: 7110
  • 50-dimensions: 37550