Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs

Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution

Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)

Warning: package 'HSAUR2' was built under R version 4.4.3

Loading required package: tools

data(USairpollution)

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

library(HSAUR2)
data(USairpollution)

# Convert the data frame to a numeric matrix
X <- as.matrix(USairpollution)

# Perform SVD
svd_result <- svd(X)
U <- svd_result$u
D <- diag(svd_result$d)
V <- svd_result$v

# Display D
D

         [,1]     [,2]    [,3]     [,4]     [,5]    [,6]    [,7]
[1,] 7051.949   0.0000   0.000  0.00000  0.00000  0.0000  0.0000
[2,]    0.000 931.1211   0.000  0.00000  0.00000  0.0000  0.0000
[3,]    0.000   0.0000 540.463  0.00000  0.00000  0.0000  0.0000
[4,]    0.000   0.0000   0.000 92.70909  0.00000  0.0000  0.0000
[5,]    0.000   0.0000   0.000  0.00000 85.23724  0.0000  0.0000
[6,]    0.000   0.0000   0.000  0.00000  0.00000 52.9465  0.0000
[7,]    0.000   0.0000   0.000  0.00000  0.00000  0.0000 10.1409

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

X_reconstructed <- U %*% D %*% t(V)

# Plot original vs reconstructed entries
plot(as.vector(X), as.vector(X_reconstructed),
     xlab = "Original X entries", ylab = "Reconstructed X entries",
     main = "Verification of SVD Reconstruction")
abline(0, 1, col = "red")

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

# Set correlation threshold
cor_threshold <- 0.9

# Initialize vector to store correlations
cor_values <- numeric(length(svd_result$d))

# Loop over ranks k = 1 to full rank
for (k in seq_along(svd_result$d)) {
  # Extract top-k components
  Uk <- U[, 1:k, drop = FALSE]
  Vk <- V[, 1:k, drop = FALSE]
  dk <- svd_result$d[1:k]
  
  # Reconstruct rank-k approximation
  Xk <- Uk %*% (dk * t(Vk))  # avoids shape mismatch
  
  # Compute correlation between flattened matrices
  cor_values[k] <- cor(as.vector(X), as.vector(Xk))
}

# Find the smallest k where correlation ≥ 0.9
min_k <- which(cor_values >= cor_threshold)[1]
min_k

[1] 1

This outputs 1 which is at least 0.9

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

The eigenvectors of \(\Sigma\) equal the columns of \(V\)
The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)

# Step 1: Compute the covariance matrix
Sigma <- cov(X)

# Step 2: Eigen-decomposition of Sigma
eig_result <- eigen(Sigma)
eig_vectors <- eig_result$vectors
eig_values <- eig_result$values

# Step 3: Compute theoretical eigenvalues from SVD
n <- nrow(X)
svd_eig_values <- svd_result$d^2 / (n - 1)

# Step 4: Compare eigenvalues
eigenvalue_check <- all.equal(round(eig_values, 6), round(svd_eig_values, 6))

# Step 5: Compare eigenvectors (up to sign)
# Normalize signs for comparison
sign_adjusted <- function(A, B) {
  sign_matrix <- sign(diag(t(A) %*% B))
  A * rep(sign_matrix, each = nrow(A))
}
eig_vectors_adj <- sign_adjusted(eig_vectors, V)
eigenvector_check <- all.equal(round(eig_vectors_adj, 6), round(V, 6))

# Output results
eigenvalue_check

[1] "Mean relative difference: 0.9450633"

eigenvector_check

[1] "Mean relative difference: 0.6244381"

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)

Warning: package 'magick' was built under R version 4.4.3

Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11

render_svd_image <- function(u, d, v) {
  # Recompose the image
  approx_matrix <- u %*% diag(d) %*% t(v)
  
  # Rescale to [0,1]
  scaled_matrix <- (approx_matrix - min(approx_matrix)) / 
                   (max(approx_matrix) - min(approx_matrix))
  
  # Convert to raster (grDevices expects a matrix, not an array)
  raster_img <- as.raster(scaled_matrix)
  
  # Read and display using magick
  img <- image_read(raster_img)
  print(img)
}

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')

# Render the k = 10 approximation
render_svd_image(mysteryU10, mysteryD10, mysteryV10)

  format width height colorspace matte filesize density
1    PNG   600    700       sRGB  TRUE        0   72x72

# Render the k = 50 approximation
render_svd_image(mysteryU50, mysteryD50, mysteryV50)

  format width height colorspace matte filesize density
1    PNG   600    700       sRGB  TRUE        0   72x72

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

I could start to tell for the k = 10, but I knew for sure who it was in the k = 50 one since it is a clearer image

C)

How many numbers need to be stored in memory for each of the following:

A full \(700\times 600\) image?
A 4-dimensional approximation?
A 10-dimensional approximation?
A 50-dimensional approximation?

# Image dimensions
height <- 700
width <- 600

# Function to compute SVD storage
svd_storage <- function(k) {
  total <- height * k + k + width * k
  return(total)
}

# Full image storage
full_image <- height * width

# SVD approximations
storage_k4 <- svd_storage(4)
storage_k10 <- svd_storage(10)
storage_k50 <- svd_storage(50)

# Display results
cat("Storage requirements:\n")

Storage requirements:

cat("Full image:         ", full_image, "numbers\n")

Full image:          420000 numbers

cat("Rank-4 approximation:", storage_k4, "numbers\n")

Rank-4 approximation: 5204 numbers

cat("Rank-10 approximation:", storage_k10, "numbers\n")

Rank-10 approximation: 13010 numbers

cat("Rank-50 approximation:", storage_k50, "numbers\n")

Rank-50 approximation: 65050 numbers