library(HSAUR2)Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools
data(USairpollution)SUBMISSION INSTRUCTIONS
Reconsider the US air pollution data set:
library(HSAUR2)Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools
data(USairpollution)Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.
# singular value decomposition of matrix.
svd_air <- svd(as.matrix(USairpollution))
# matrices
U <- svd_air$u
D <- diag(svd_air$d)
V <- svd_air$v
# view
svd_air$d[1] 7051.94936 931.12109 540.46297 92.70909 85.23724 52.94650 10.14090
round(D, 3) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 7051.949 0.000 0.000 0.000 0.000 0.000 0.000
[2,] 0.000 931.121 0.000 0.000 0.000 0.000 0.000
[3,] 0.000 0.000 540.463 0.000 0.000 0.000 0.000
[4,] 0.000 0.000 0.000 92.709 0.000 0.000 0.000
[5,] 0.000 0.000 0.000 0.000 85.237 0.000 0.000
[6,] 0.000 0.000 0.000 0.000 0.000 52.946 0.000
[7,] 0.000 0.000 0.000 0.000 0.000 0.000 10.141
Diagonal matrix containing the values (5053.6, 769.7, 167.6, 90.6, 68.3, 22.0, 7.6). These values decrease from top left to bottom right and represent the variation. All non diagonal entries are zero.
Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.
X <- as.matrix(USairpollution)
X_reconstructed <- U %*% D %*% t(V)
plot(as.vector(X), as.vector(X_reconstructed),
xlab = "Original X entries",
ylab = "Reconstructed X (UDV') entries",
main = "Verification of X = UDV'",
pch = 19, col = "steelblue")
abline(0, 1, col = "red", lwd = 2)Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?
# low-dimensional approximations of the data matrix
correlation_by_k <- function(k) {
U_k <- U[, 1:k, drop = FALSE]
D_k <- diag(svd_air$d[1:k], nrow = k, ncol = k)
V_k <- V[, 1:k, drop = FALSE]
X_tilde <- U_k %*% D_k %*% t(V_k)
cor(as.vector(X), as.vector(X_tilde))
}
k_values <- 1:ncol(X)
cors <- sapply(k_values, correlation_by_k)
cors[1] 0.9863996 0.9965504 0.9997697 0.9998753 0.9999643 0.9999987 1.0000000
# smallest k where correlation ≥ 0.9
min_k <- min(k_values[cors >= 0.9])
min_k[1] 1
The correlation and its one-dimensional approximation is 0.9878, meaning only one dimension is required to achieve a correlation of at least 0.9.
Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that
# mean-center
X_centered <- scale(USairpollution, center = TRUE, scale = FALSE)
# singular value decomposition
svd_centered <- svd(X_centered)
# covariance matrix
Sigma <- cov(X_centered)
# eigen-decomposition
eig_Sigma <- eigen(Sigma)
# compare eigenvectors
round(eig_Sigma$vectors, 4) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.0169 -0.0998 0.2088 0.9588 -0.1522 -0.0540 -0.0270
[2,] -0.0011 0.0258 -0.0716 -0.1101 -0.4779 -0.8525 -0.1641
[3,] 0.6968 -0.7102 -0.0672 -0.0732 -0.0096 -0.0022 0.0011
[4,] 0.7170 0.6929 0.0567 0.0491 0.0107 0.0029 -0.0001
[5,] 0.0004 -0.0010 0.0054 -0.0151 0.0254 0.1765 -0.9838
[6,] -0.0004 0.0012 0.2658 -0.1626 -0.8327 0.4532 0.0638
[7,] 0.0029 -0.0692 0.9343 -0.1846 0.2328 -0.1836 -0.0189
round(svd_centered$v, 4) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -0.0169 0.0998 0.2088 -0.9588 0.1522 -0.0540 -0.0270
[2,] 0.0011 -0.0258 -0.0716 0.1101 0.4779 -0.8525 -0.1641
[3,] -0.6968 0.7102 -0.0672 0.0732 0.0096 -0.0022 0.0011
[4,] -0.7170 -0.6929 0.0567 -0.0491 -0.0107 0.0029 -0.0001
[5,] -0.0004 0.0010 0.0054 0.0151 -0.0254 0.1765 -0.9838
[6,] 0.0004 -0.0012 0.2658 0.1626 0.8327 0.4532 0.0638
[7,] -0.0029 0.0692 0.9343 0.1846 -0.2328 -0.1836 -0.0189
# compare eigenvalues
round(eig_Sigma$values, 4)[1] 638471.9620 14812.0379 701.9599 205.0019 116.7047 12.0570
[7] 1.4487
round((svd_centered$d^2) / (nrow(X_centered) - 1), 4)[1] 638471.9620 14812.0379 701.9599 205.0019 116.7047 12.0570
[7] 1.4487
They are identical confirming that for the mean centered data the eigen decomposition contains the same information about the direction and magnitude of the variance in the data set.
In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.
.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.
load('Data/mystery_person_k4.Rdata')Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].
library(magick)Warning: package 'magick' was built under R version 4.4.3
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
# svd image render
render_svd_image <- function(u, d, v) {
img_matrix <- u %*% diag(d) %*% t(v)
# rescale pixel values to [0, 1]
img_matrix <- (img_matrix - min(img_matrix)) / (max(img_matrix) - min(img_matrix))
img <- image_read(as.raster(img_matrix))
print(img)
}
# 4D approximation
render_svd_image(mysteryU4, mysteryD4, mysteryV4) format width height colorspace matte filesize density
1 PNG 600 700 sRGB TRUE 0 72x72
I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:
load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')# reconstruct image matrix from SVD
reconstruct_image <- function(u, d, v) {
img_matrix <- u %*% diag(d) %*% t(v)
img_matrix <- (img_matrix - min(img_matrix)) / (max(img_matrix) - min(img_matrix))
return(img_matrix)
}
# reconstructed images
img10 <- reconstruct_image(mysteryU10, mysteryD10, mysteryV10)
img50 <- reconstruct_image(mysteryU50, mysteryD50, mysteryV50)
image_read(as.raster(img10))image_read(as.raster(img50))Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?
T odd Iversion at K = 50
How many numbers need to be stored in memory for each of the following:
# dimensions of the image
n <- 700
m <- 600
# full image
full_image <- n * m
# k-dimensional SVD approximation
storage_needed <- function(k) {
(n * k) + k + (m * k)
}
# compute for each case
k_values <- c(4, 10, 50)
storage <- sapply(k_values, storage_needed)
# combine results
results <- data.frame(
k = c("Full (700x600)", "4D", "10D", "50D"),
Numbers_Stored = c(full_image, storage)
)
results k Numbers_Stored
1 Full (700x600) 420000
2 4D 5204
3 10D 13010
4 50D 65050
Full = 420000 | 4D = 5204 | 10D = 13010 | 50D = 65050 |