Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)

Loading required package: tools

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data(USairpollution)

dim(USairpollution)

[1] 41  7

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

# Extract numeric columns only
X <- as.matrix(USairpollution[, sapply(USairpollution, is.numeric)])

#X_scaled <- scale(X)

# Perform SVD
svd_result <- svd(X)

U <- svd_result$u
V <- svd_result$v

# Diagonal matrix D of singular values
D <- diag(svd_result$d)
D

         [,1]     [,2]    [,3]     [,4]     [,5]    [,6]    [,7]
[1,] 7051.949   0.0000   0.000  0.00000  0.00000  0.0000  0.0000
[2,]    0.000 931.1211   0.000  0.00000  0.00000  0.0000  0.0000
[3,]    0.000   0.0000 540.463  0.00000  0.00000  0.0000  0.0000
[4,]    0.000   0.0000   0.000 92.70909  0.00000  0.0000  0.0000
[5,]    0.000   0.0000   0.000  0.00000 85.23724  0.0000  0.0000
[6,]    0.000   0.0000   0.000  0.00000  0.00000 52.9465  0.0000
[7,]    0.000   0.0000   0.000  0.00000  0.00000  0.0000 10.1409

This is a p * p matrix, where p is the amount of numeric variables are in the data (in this case, 7.)

The diagonal non-zero values are the single values σ, sorted from greatest to least. It measures the amount of variation along the i-th singular vector.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

X_reconstructed <- svd_result$u %*% D %*% t(svd_result$v)

# Flatten matrices to vectors
X_vec <- as.vector(X)
X_recon_vec <- as.vector(X_reconstructed)

# Scatter plot
plot(X_vec, X_recon_vec, 
     xlab = "Original X", 
     ylab = "Reconstructed UDV^T",
     main = "SVD",
     pch = 16, col = "blue")

# Add the 0/1 line (y = x)
abline(0, 1, col = "red", lwd = 2)

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

# Numeric columns
X <- as.matrix(USairpollution[, sapply(USairpollution, is.numeric)])

# SVD
components <- svd(X)
U <- components$u   
D <- components$d
V <- components$v


k <- 2

(Xtilde2 <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k])) %>% round(1)

       [,1]  [,2]   [,3]   [,4] [,5] [,6]  [,7]
 [1,]   3.5  23.8   17.2  151.5  3.6 15.2  42.1
 [2,]   5.9  35.8   45.7  244.5  5.5 22.8  63.7
 [3,]  17.3  37.8  352.3  516.8  6.2 24.6  73.1
 [4,]  30.4  63.4  629.6  900.0 10.5 41.3 123.5
 [5,]  17.1  32.9  363.0  498.0  5.5 21.5  64.8
 [6,]   2.4  20.0    0.5  115.4  3.0 12.7  35.1
 [7,] 131.5  80.9 3371.0 3334.4 17.1 56.6 208.0
 [8,]  18.2  20.7  433.8  488.4  3.8 13.8  44.9
 [9,]  34.5  -2.1  962.8  807.5  1.1  0.1  15.2
[10,]  16.2  56.9  260.0  548.2  9.0 36.6 104.9
[11,]  29.3  50.8  643.7  839.4  8.6 33.3 101.6
[12,]  19.2  26.4  443.0  528.5  4.6 17.4  54.9
[13,]   6.3  26.2   86.1  223.8  4.1 16.8  47.7
[14,]  50.9  96.7 1088.6 1481.9 16.2 63.2 190.8
[15,]  11.3 -15.3  362.9  221.1 -1.8 -9.2 -19.7
[16,]  38.8  98.3  746.5 1199.8 16.0 63.6 186.9
[17,]  22.1  73.6  368.8  736.7 11.7 47.4 136.3
[18,]  13.5  71.1  138.3  526.0 11.0 45.4 127.3
[19,]  17.7  35.8  369.9  520.6  6.0 23.4  70.1
[20,]   4.6  18.8   65.6  163.8  2.9 12.1  34.2
[21,]  17.7  60.9  289.1  595.9  9.6 39.2 112.5
[22,]  19.2  58.8  336.5  624.1  9.4 37.9 109.7
[23,]  11.0  36.3  183.1  364.5  5.8 23.4  67.3
[24,]  25.5  43.8  559.9  728.1  7.4 28.7  87.7
[25,]  28.5  32.6  680.9  766.8  5.9 21.8  70.5
[26,]  14.5  42.3  261.2  465.3  6.8 27.3  79.3
[27,]  11.4  38.9  186.7  382.2  6.2 25.0  71.8
[28,]   8.3  44.1   83.6  324.4  6.8 28.2  78.9
[29,]  10.6  37.2  171.2  359.4  5.9 23.9  68.6
[30,]  71.6  81.4 1712.1 1924.8 14.8 54.4 176.3
[31,]  15.9  62.2  233.5  555.9  9.8 39.9 113.6
[32,]  17.4  45.1  331.6  541.4  7.3 29.2  85.6
[33,]  10.4  -4.0  301.8  233.9 -0.2 -2.1  -1.1
[34,]  10.0  30.3  177.0  324.6  4.8 19.5  56.6
[35,]   6.4  17.3  118.4  200.1  2.8 11.2  32.7
[36,]  23.2  53.9  462.5  703.5  8.8 35.0 103.6
[37,]  18.2  44.1  358.0  558.0  7.2 28.6  84.3
[38,]  27.5   4.6  745.0  660.4  1.8  4.0  22.6
[39,]  23.7  65.8  438.8  751.3 10.6 42.5 123.9
[40,]   8.1  33.2  115.3  288.9  5.2 21.3  60.5
[41,]   3.4  13.3   49.9  118.9  2.1  8.5  24.3

# Plot original vs reconstructed
plot(X, Xtilde2)
abline(0,1)

# Correlation
cor(as.vector(X), as.vector(Xtilde2))

[1] 0.9965504

From this chart, we can get as low as 2 dimensions, and still have a pretty high correlation, at ~0.996.

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

The eigenvectors of \(\Sigma\) equal the columns of \(V\)
The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)

n <- nrow(X)

# Center the data, DON'T SCALE
Xc <- scale(X, center = TRUE, scale = FALSE)

# SVD
svd_out <- svd(Xc)
U <- svd_out$u
D <- svd_out$d
V <- svd_out$v

# Covariance matrix
Sigma <- cov(Xc)

# Eigen-decomposition of Sigma
eig3n <- eigen(Sigma)

# Check the relationships
round(eig3n$values, 5)

[1] 638471.96204  14812.03788    701.95994    205.00194    116.70468
[6]     12.05705      1.44870

round(D^2 / (n - 1), 5)     # eigenvalues ≈ D^2/(n−1)

[1] 638471.96204  14812.03788    701.95994    205.00194    116.70468
[6]     12.05705      1.44870

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

img<-load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)

Warning: package 'magick' was built under R version 4.5.2

Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11

U <- mysteryU4
D <- mysteryD4
V <- mysteryV4


k <- 4

# Reconstruct low-rank approximation
Xtilde <- U[, 1:k, drop = FALSE] %*% diag(D[1:k]) %*% t(V[, 1:k, drop = FALSE])

# Scale to [0,1]
Xtilde <- Xtilde - min(Xtilde)
Xtilde <- Xtilde / max(Xtilde)

# Convert to raster and grayscale PNG
img <- Xtilde %>%
  as.raster() %>%
  image_read() %>%
  image_convert(colorspace = "Gray")

print(img)

# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       TRUE         0 72x72

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')

U10 <- mysteryU10
D10 <- mysteryD10
V10 <- mysteryV10

k <- 10

# Reconstruct low-rank approximation
Xtilde10 <- U10[, 1:k, drop = FALSE] %*% diag(D10[1:k]) %*% t(V10[, 1:k, drop = FALSE])

# Scale to [0,1]
Xtilde10 <- Xtilde10 - min(Xtilde10)
Xtilde10 <- Xtilde10 / max(Xtilde10)

# Convert to raster and grayscale PNG
img10 <- Xtilde10 %>%
  as.raster() %>%
  image_read() %>%
  image_convert(colorspace = "Gray")

print(img10)

# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       TRUE         0 72x72

load('Data/mystery_person_k50.Rdata')

U50 <- mysteryU50
D50 <- mysteryD50
V50 <- mysteryV50


k <- 50

# Reconstruct low-rank approximation
Xtilde50 <- U50[, 1:k, drop = FALSE] %*% diag(D50[1:k]) %*% t(V50[, 1:k, drop = FALSE])

# Scale to [0,1]
Xtilde50 <- Xtilde50 - min(Xtilde50)
Xtilde50 <- Xtilde50 / max(Xtilde50)

# Convert to raster and grayscale PNG
img50 <- Xtilde50 %>%
  as.raster() %>%
  image_read() %>%
  image_convert(colorspace = "Gray")


print(img50)

# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       TRUE         0 72x72

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?
k=50 is a very clear image, but you can reduce k to about 20, and can make out with relative ease who the person is.

C)

How many numbers need to be stored in memory for each of the following:

A full \(700\times 600\) image?
A 4-dimensional approximation?
A 10-dimensional approximation?
A 50-dimensional approximation?
Part A is simple, you need a dimension for each pixel. 700x600=420,000
FORMULA FOR APPROXIMATION:(MxK)+k+(NxK)
For Part B, we get 2800+4+2400= 5204
For Part C, we get 7000+10+6000 = 13010 For Part D, we get 35000 + 50 + 30000 = 65050