Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  1. Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)
Loading required package: tools
library(ggplot2)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

library(HSAUR2)
library(ggplot2)

data("USairpollution", package = "HSAUR2")

X <- as.matrix(USairpollution[, sapply(USairpollution, is.numeric)])

svd_result <- svd(X)

U <- svd_result$u
D_values <- svd_result$d
V <- svd_result$v

D <- diag(D_values)

print("Matrix D:")
[1] "Matrix D:"
print(D)
         [,1]     [,2]    [,3]     [,4]     [,5]    [,6]    [,7]
[1,] 7051.949   0.0000   0.000  0.00000  0.00000  0.0000  0.0000
[2,]    0.000 931.1211   0.000  0.00000  0.00000  0.0000  0.0000
[3,]    0.000   0.0000 540.463  0.00000  0.00000  0.0000  0.0000
[4,]    0.000   0.0000   0.000 92.70909  0.00000  0.0000  0.0000
[5,]    0.000   0.0000   0.000  0.00000 85.23724  0.0000  0.0000
[6,]    0.000   0.0000   0.000  0.00000  0.00000 52.9465  0.0000
[7,]    0.000   0.0000   0.000  0.00000  0.00000  0.0000 10.1409
ggplot(data.frame(SingularValue = D_values, Index = seq_along(D_values)),
aes(x = Index, y = SingularValue)) +
geom_point(size = 3, color = "blue") +
geom_line() +
labs(title = "Singular Values from SVD (US Air Pollution Data)",
x = "Index",
y = "Singular Value") +
theme_minimal()

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

library(HSAUR2)
library(ggplot2)

data("USairpollution", package = "HSAUR2")

X <- as.matrix(USairpollution[, sapply(USairpollution, is.numeric)])

svd_result <- svd(X)

U <- svd_result$u
D_values <- svd_result$d
V <- svd_result$v

D <- diag(D_values)

X_reconstructed <- U[, 1:length(D_values)] %*% D %*% t(V[, 1:length(D_values)])

df_compare <- data.frame(
Original = as.vector(X),
Reconstructed = as.vector(X_reconstructed)
)

ggplot(df_compare, aes(x = Original, y = Reconstructed)) +
geom_point(color = "blue", alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
labs(title = "Verification of X = UDVᵀ (US Air Pollution Data)",
x = "Original X Entries",
y = "Reconstructed X Entries") +
theme_minimal()

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

library(HSAUR2)
library(ggplot2)

data("USairpollution", package = "HSAUR2")

X <- as.matrix(USairpollution[, sapply(USairpollution, is.numeric)])

svd_result <- svd(X)
U <- svd_result$u
D_values <- svd_result$d
V <- svd_result$v

r <- length(D_values)
orig_vec <- as.vector(X)

cors <- numeric(r)
for (k in 1:r) {
Dk <- diag(D_values[1:k], nrow = k, ncol = k)
Xk <- U[, 1:k] %*% Dk %*% t(V[, 1:k])
cors[k] <- cor(orig_vec, as.vector(Xk), use = "complete.obs")
}

min_k <- which(cors >= 0.9)[1]
if (is.na(min_k)) min_k <- NA

print(min_k)
[1] 1
print(cors)
[1] 0.9863996 0.9965504 0.9997697 0.9998753 0.9999643 0.9999987 1.0000000
df <- data.frame(k = 1:r, correlation = cors)
ggplot(df, aes(x = k, y = correlation)) +
geom_point() + geom_line() +
geom_hline(yintercept = 0.9, linetype = "dashed") +
labs(title = "Correlation vs Rank k", x = "k (dimensions)", y = "cor(X, X_tilde)")

D)

(For 1D, to verify that the eigenvectors and values of Sigma match the corresponding ingredients of the SVD, you will need to first mean-center (but not scale!) your data matrix.  You can do this with scale(df, center = TRUE, scale = FALSE).  Then SVD this mean-centered data.)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

  • The eigenvectors of \(\Sigma\) equal the columns of \(V\)

  • The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)

X_centered <- scale(X, center = TRUE, scale = FALSE)

n <- nrow(X_centered)

Sigma <- cov(X_centered) # (1/(n-1)) * t(X_centered) %*% X_centered

eig <- eigen(Sigma)
eig_values <- eig$values
eig_vectors <- eig$vectors

svd_res <- svd(X_centered)
U <- svd_res$u
D_vals <- svd_res$d
V <- svd_res$v

vec_cor <- sapply(seq_len(ncol(V)), function(j) {
abs(cor(eig_vectors[, j], V[, j]))
})

svd_eig_equiv <- (D_vals^2) / (n - 1)
val_diff <- eig_values - svd_eig_equiv

print(vec_cor)
[1] 1 1 1 1 1 1 1
print("Eigenvalues (from cov) and D^2/(n-1):")
[1] "Eigenvalues (from cov) and D^2/(n-1):"
print(data.frame(eigen_from_Sigma = eig_values, Dsq_over_nminus1 = svd_eig_equiv, diff = val_diff))
  eigen_from_Sigma Dsq_over_nminus1          diff
1     6.384720e+05     6.384720e+05 -6.984919e-10
2     1.481204e+04     1.481204e+04 -5.456968e-11
3     7.019599e+02     7.019599e+02  1.136868e-13
4     2.050019e+02     2.050019e+02  1.733724e-12
5     1.167047e+02     1.167047e+02  1.398348e-11
6     1.205705e+01     1.205705e+01  4.447998e-11
7     1.448704e+00     1.448704e+00  1.619149e-12
print(paste("Max absolute difference in eigenvalues:", max(abs(val_diff))))
[1] "Max absolute difference in eigenvalues: 6.98491930961609e-10"

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)
Warning: package 'magick' was built under R version 4.5.2
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
render_image_from_svd <- function(u, d, v, height = 700, width = 600) {

D <- diag(d)
X_approx <- u %*% D %*% t(v)
X_rescaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))
img <- image_read(as.raster(X_rescaled))

img <- image_resize(img, paste0(width, "x", height, "!"))
print(img)

return(img)
}
load("Data/mystery_person_k4.Rdata")

render_image_from_svd(mysteryU4, mysteryD4, mysteryV4, height = 700, width = 600)
# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 sRGB       TRUE         0 72x72  

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is? Answer: Knowing you, I could’ve guessed that it was going to be Todd. We can clearly see it by k=50, and we can maybe kinda see it from k=10.

render_image_from_svd <- function(u, d, v) {
k <- length(d)
D <- diag(d)

X_approx <- u[, 1:k] %*% D %*% t(v[, 1:k])

X_rescaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))

X_rescaled <- t(X_rescaled)

grid::grid.raster(X_rescaled)

return(X_rescaled)
}

img10 <- render_image_from_svd(mysteryU10, mysteryD10, mysteryV10)
img50 <- render_image_from_svd(mysteryU50, mysteryD50, mysteryV50)

C)

How many numbers need to be stored in memory for each of the following:

  • A full \(700\times 600\) image?
  • A 4-dimensional approximation?
  • A 10-dimensional approximation?
  • A 50-dimensional approximation?

Full image: 100×100=10,000

4-dimensional approximation: k=4 4⋅(100+100+1)=4⋅201=804

10-dimensional approximation: k=10 10⋅(100+100+1)=10⋅201=2,010

50-dimensional approximation: k=50 50⋅(100+100+1)=50⋅201=10,050