library(HSAUR2)Loading required package: tools
data(USairpollution)SUBMISSION INSTRUCTIONS
Reconsider the US air pollution data set:
library(HSAUR2)Loading required package: tools
data(USairpollution)Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.
library(MASS)
mu <- c(4,3)
sig1 <- sqrt(2)
sig2 <- sqrt(2)
rho <- 0.8
n <- 40
Sigma <- matrix(c(sig1^2, rho*sig1*sig2,
rho*sig1*sig2, sig2^2), 2, 2)
set.seed(1)
X <- mvrnorm(n, mu = mu, Sigma = Sigma)
Xc <- scale(X, center = TRUE, scale = FALSE)
sv <- svd(Xc)
U <- sv$u
V <- sv$v
D <- diag(sv$d)
D [,1] [,2]
[1,] 10.54392 0.000000
[2,] 0.00000 3.546389
This matrix looks like 2x2 diagonal matrix. The diagonal entries are the singular values.
Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.
library(MASS)
mu <- c(4,3)
sig1 <- sqrt(2)
sig2 <- sqrt(2)
rho <- 0.8
n <- 40
Sigma <- matrix(c(sig1^2, rho*sig1*sig2,
rho*sig1*sig2, sig2^2), 2, 2)
set.seed(1)
X <- mvrnorm(n, mu = mu, Sigma = Sigma)
Xc <- scale(X, center = TRUE, scale = FALSE)
sv <- svd(Xc)
U <- sv$u
V <- sv$v
D <- diag(sv$d)
D [,1] [,2]
[1,] 10.54392 0.000000
[2,] 0.00000 3.546389
plot(c(Xc), c(U %*% D %*% t(V)))
abline(0, 1, col= "magenta", lwd = 2)Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?
corrs <- sapply(1:ncol(V), function(k){
Xk <- U[, 1:k, drop = FALSE] %*% D[1:k, 1:k, drop = FALSE] %*% t(V[, 1:k, drop = FALSE])
cor(c(Xc), c(Xk))
})
corrs[1] 0.9478236 1.0000000
which(corrs >= 0.9) [1][1] 1
The fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9 is 1.
Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that
n <- nrow(USairpollution)
USairpollution_centered <- scale(USairpollution, center = TRUE, scale = FALSE)
Sigma <- cov(USairpollution)
eigen_decomp <- eigen(Sigma)
eigenvalues_from_Sigma <- eigen_decomp$values
eigenvectors_from_Sigma <- eigen_decomp$vectors
eigenvalues_from_Sigma[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
eigenvectors_from_Sigma [,1] [,2] [,3] [,4] [,5]
[1,] 0.0168607518 -0.099835625 0.208775573 0.95883106 -0.152191203
[2,] -0.0011417794 0.025814390 -0.071600745 -0.11014784 -0.477854201
[3,] 0.6968327936 -0.710249079 -0.067182201 -0.07319788 -0.009643654
[4,] 0.7170284512 0.692912523 0.056666935 0.04906669 0.010735457
[5,] 0.0004067530 -0.001011680 0.005386606 -0.01506609 0.025401917
[6,] -0.0004336922 0.001225937 0.265807619 -0.16261712 -0.832729325
[7,] 0.0028836950 -0.069155051 0.934279828 -0.18459052 0.232812295
[,6] [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023 1.136208e-03
[4,] 0.002915751 -5.682006e-05
[5,] 0.176541256 -9.838347e-01
[6,] 0.453206155 6.376788e-02
[7,] -0.183568735 -1.891454e-02
svd_decomp <- svd(USairpollution_centered)
D_values <- svd_decomp$d
V <- svd_decomp$v
D_values[1] 5053.60055 769.72821 167.56610 90.55428 68.32413 21.96092 7.61237
print(all.equal(abs(eigenvectors_from_Sigma), abs(V)))[1] TRUE
eigenvalues_from_SVD <- (D_values^2) / (n - 1)
eigenvalues_from_Sigma[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
eigenvalues_from_SVD[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.
.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.
load('Data/mystery_person_k4.Rdata')Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].
library(magick)Linking to ImageMagick 6.9.13.29
Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
Disabled features: fftw, ghostscript, x11
show_svd_image <- function(u, d, v) {
approximate <- u %*% diag(d) %*% t(v)
scaled <- (approximate - min(approximate)) / (max(approximate) - min(approximate))
img_raster <- as.raster(scaled, max = 1)
plot(as.raster(img_raster))
}
show_svd_image(mysteryU4, mysteryD4, mysteryV4)I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:
Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?
library(magick)
show_svd_image <- function(u, d, v) {
approximate <- u %*% diag(d) %*% t(v)
scaled <- (approximate - min(approximate)) / (max(approximate) - min(approximate))
img_raster <- as.raster(scaled, max = 1)
plot(img_raster)}
load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')
show_svd_image(mysteryU10, mysteryD10, mysteryV10)show_svd_image(mysteryU50, mysteryD50, mysteryV50)I can tell who it when k = 10, but it is slightly blurry but when k = 50 I can tell it’s obviously the villain.
How many numbers need to be stored in memory for each of the following:
A full \(700\times 600\) image? 700 * 600 = 420,000 k(700 + 600 + 1) = 1301k
A 4-dimensional approximation? 1301 * 4 = 5,204
A 10-dimensional approximation? 1301 * 10 = 13,010
A 50-dimensional approximation? 1301 * 50 = 65,050