library(HSAUR2)Loading required package: tools
data(USairpollution)SUBMISSION INSTRUCTIONS
Reconsider the US air pollution data set:
library(HSAUR2)Loading required package: tools
data(USairpollution)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.
X <- as.matrix (USairpollution)
components <- svd(X)
U <- components$u
d <- components$d
V <- components$v
d %>% round (2)[1] 7051.95 931.12 540.46 92.71 85.24 52.95 10.14
X_scaled <- scale(X)
X_svd <- svd(X_scaled)
U <- X_svd$u
D<- diag(X_svd$d)
V <- X_svd$vThe matrix D is pxp diagonal matrix of the singular values. It’s the lower - dimension representation of what X looks like, and requires fewer numbers stored in memory.
Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.
r <- length (d)
D <- diag(d)
X_reconstructed <- U[, 1:r] %*% D %*% t(V[, 1:r])UDVt_df <- data.frame(U %% diag(D) %% t(V)) base_canvas + geom_point (aes(x = X1, y= X2), data = UDVt_df, size = 0.2)
Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?
Answer: 3.
Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that
Sigma <- cov(X)
Sigma SO2 temp manu popul wind
SO2 550.947561 -73.560671 8527.7201 6711.9945 3.1753049
temp -73.560671 52.239878 -773.9713 -262.3496 -3.6113537
manu 8527.720122 -773.971341 317502.8902 311718.8140 191.5481098
popul 6711.994512 -262.349634 311718.8140 335371.8939 175.9300610
wind 3.175305 -3.611354 191.5481 175.9301 2.0410244
precip 15.001799 32.862988 -215.0199 -178.0529 -0.2185311
predays 229.929878 -82.426159 1968.9598 645.9860 6.2143902
precip predays
SO2 15.0017988 229.92988
temp 32.8629884 -82.42616
manu -215.0199024 1968.95976
popul -178.0528902 645.98598
wind -0.2185311 6.21439
precip 138.5693840 154.79290
predays 154.7929024 702.59024
eigen(cor(X))$vector[,1][1] 0.4896988171 -0.3153706901 0.5411687028 0.4875881115 0.2498749284
[6] 0.0001873122 0.2601790729
X_scaled <- scale(X)
V <- svd(X_scaled)$v
V[,1][1] -0.4896988171 0.3153706901 -0.5411687028 -0.4875881115 -0.2498749284
[6] -0.0001873122 -0.2601790729
^^ both are matching - eigenvectors of sigma = columns of V except there’s negatives in the columns of V, not sure if I did the code correct.
PCs <- U %*% D
head (PCs, 3) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 22.86095 -107.60561 96.69918 17.522931 -4.903813 -0.9515431 -1.3335429
[2,] 1211.42817 326.75315 86.23811 4.439285 -4.622188 -7.0685433 -0.2032449
[3,] 412.46535 -75.45218 -70.63299 -3.005908 -2.535158 2.5848154 0.7051583
PCs <- X_scaled %*% V
head(PCs, 3) [,1] [,2] [,3] [,4] [,5] [,6]
Albany 0.03386467 -0.8988407 1.3365024 1.1290021 -0.2142696 -0.03599522
Albuquerque 1.79452779 2.7294026 1.1919175 0.2860230 -0.2019642 -0.26739069
Atlanta 0.61099829 -0.6302598 -0.9762353 -0.1936706 -0.1107725 0.09777906
[,7]
Albany -0.13284885
Albuquerque -0.02024746
Atlanta 0.07024856
(PCs
%>% data.frame()
%>% summarize (across(everything(), var))
) X1 X2 X3 X4 X5 X6 X7
1 2.72812 1.512335 1.394973 0.8919913 0.3467787 0.1002876 0.02551493
R <-cor(X)
eigen(R)$values[1] 2.72811968 1.51233485 1.39497299 0.89199129 0.34677866 0.10028759 0.02551493
EigenValues and Varience of PC’s are exactly equal.
In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.
.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.
load('/Users/rosagomez/Desktop/DSCI 415/Data/mystery_person_k4.Rdata')Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].
library (magick)Linking to ImageMagick 6.9.13.29
Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
Disabled features: fftw, ghostscript, x11
show_svd_image <- function(u, d, v, k = length(d),
img_height = nrow(u),
img_width = nrow(v),
main_title = NULL) {
Dk <- diag(d[1:k], nrow = k, ncol = k)
Xk <- u[, 1:k] %*% Dk %*% t(v[, 1:k])
Xk_rescaled <- (Xk - min(Xk)) / (max(Xk) - min(Xk))
image(t(Xk_rescaled[nrow(Xk_rescaled):1, ]),
col = gray(seq(0, 1, length = 256)),
axes = FALSE,
main = if (is.null(main_title)) {
paste("k =", k, "approximation")
} else main_title)
}show_svd_image(mysteryU4, mysteryD4, mysteryV4, k=4,
main_title= "4D SVD approx. ")I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:
load('/Users/rosagomez/Desktop/DSCI 415/Data/mystery_person_k10.Rdata')
load('/Users/rosagomez/Desktop/DSCI 415/Data/mystery_person_k50.Rdata')Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?
show_svd_image(mysteryU10, mysteryD10, mysteryV10, k=10,
main_title =" K10 approx")show_svd_image(mysteryU50, mysteryD50, mysteryV50,
k=50,
main_title = "Dr. Todd?")How many numbers need to be stored in memory for each of the following:
full 700x600 image =
700 * 600 = 420,000 numbers
4D = 700 *4 = 2800,
600 *4 = 2400
2800 + 2400 + 4 = 5204 numbers
10D =
700 * 10 = 7000
600 * 10 = 6000
6000 + 7000 + 10 = 13,010 numbers
50D =
700 * 50 = 35,000
600 * 50 = 30,000
35,000 + 30,000 + 50 = 65,050 numbers