SUBMISSION INSTRUCTIONS
Render to html
Publish your html to RPubs
Click the Publish button
Select RPubs
Follow the prompts for creating an account on RPubs and publishing your solution
Submit a link to your published solutions
Problem 1
Reconsider the US air pollution data set:
Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools
Warning: package 'dplyr' was built under R version 4.4.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 4.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (ggplot2)
data (USairpollution)
SO2 temp manu popul wind precip predays
Albany 46 47.6 44 116 8.8 33.36 135
Albuquerque 11 56.8 46 244 8.9 7.77 58
Atlanta 24 61.5 368 497 9.1 48.34 115
Baltimore 47 55.0 625 905 9.6 41.31 111
Buffalo 11 47.1 391 463 12.4 36.11 166
Charleston 31 55.2 35 71 6.5 40.75 148
A)
Perform singular value decomposition of this data matrix. Then create the matrix \(D\) . Describe what this matrix looks like.
X <- as.matrix (USairpollution)
decompose_USairpollution <- svd (USairpollution)
u <- decompose_USairpollution$ u
d <- decompose_USairpollution$ d
v <- decompose_USairpollution$ v
d %>% head
[1] 7051.94936 931.12109 540.46297 92.70909 85.23724 52.94650
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 7051.949 0.0000 0.000 0.00000 0.00000 0.0000 0.0000
[2,] 0.000 931.1211 0.000 0.00000 0.00000 0.0000 0.0000
[3,] 0.000 0.0000 540.463 0.00000 0.00000 0.0000 0.0000
[4,] 0.000 0.0000 0.000 92.70909 0.00000 0.0000 0.0000
[5,] 0.000 0.0000 0.000 0.00000 85.23724 0.0000 0.0000
[6,] 0.000 0.0000 0.000 0.00000 0.00000 52.9465 0.0000
[7,] 0.000 0.0000 0.000 0.00000 0.00000 0.0000 10.1409
# Matrix D is a diagonal patrix of the singular values of the decomposed USairpollution. These singular values are in a diagonal pattern, decreasing as they go from top-left to bottom-right. This shows the strength of each dimension in the data. The larger singular values, like 1,1 and 2,2, show the directions in which the data has the most variance.
B) Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.
X_reconstructed <- u %*% diag (d) %*% t (v)
X_reconstructed
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 46 47.6 44 116 8.8 33.36 135
[2,] 11 56.8 46 244 8.9 7.77 58
[3,] 24 61.5 368 497 9.1 48.34 115
[4,] 47 55.0 625 905 9.6 41.31 111
[5,] 11 47.1 391 463 12.4 36.11 166
[6,] 31 55.2 35 71 6.5 40.75 148
[7,] 110 50.6 3344 3369 10.4 34.44 122
[8,] 23 54.0 462 453 7.1 39.04 132
[9,] 65 49.7 1007 751 10.9 34.99 155
[10,] 26 51.5 266 540 8.6 37.01 134
[11,] 9 66.2 641 844 10.9 35.94 78
[12,] 17 51.9 454 515 9.0 12.95 86
[13,] 17 49.0 104 201 11.2 30.85 103
[14,] 35 49.9 1064 1513 10.1 30.96 129
[15,] 56 49.1 412 158 9.0 43.37 127
[16,] 10 68.9 721 1233 10.8 48.19 103
[17,] 28 52.3 361 746 9.7 38.74 121
[18,] 14 68.4 136 529 8.8 54.47 116
[19,] 14 54.5 381 507 10.0 37.00 99
[20,] 13 61.0 91 132 8.2 48.52 100
[21,] 30 55.6 291 593 8.3 43.11 123
[22,] 10 61.6 337 624 9.2 49.10 105
[23,] 10 75.5 207 335 9.0 59.80 128
[24,] 16 45.7 569 717 11.8 29.07 123
[25,] 29 43.5 699 744 10.6 25.94 137
[26,] 18 59.4 275 448 7.9 46.00 119
[27,] 9 68.3 204 361 8.4 56.77 113
[28,] 31 59.3 96 308 10.6 44.68 116
[29,] 14 51.5 181 347 10.9 30.18 98
[30,] 69 54.6 1692 1950 9.6 39.93 115
[31,] 10 70.3 213 582 6.0 7.05 36
[32,] 61 50.4 347 520 9.4 36.22 147
[33,] 94 50.0 343 179 10.6 42.75 125
[34,] 26 57.8 197 299 7.6 42.59 115
[35,] 28 51.0 137 176 8.7 15.17 89
[36,] 12 56.7 453 716 8.7 20.66 67
[37,] 29 51.1 379 531 9.4 38.79 164
[38,] 56 55.9 775 622 9.5 35.89 105
[39,] 29 57.3 434 757 9.3 38.89 111
[40,] 8 56.6 125 277 12.7 30.58 82
[41,] 36 54.0 80 80 9.0 40.25 114
plot (as.vector (X), as.vector (X_reconstructed),
xlab = "X" ,
ylab = "UDVT" ,
main = "Verification Plot" )
abline (0 , 1 , col = "red" )
C)
Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?
k <- 2
(Xtilde2 <- u[,1 : k] %*% diag (d[1 : k]) %*% t (v[,1 : k])) %>% round (1 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 3.5 23.8 17.2 151.5 3.6 15.2 42.1
[2,] 5.9 35.8 45.7 244.5 5.5 22.8 63.7
[3,] 17.3 37.8 352.3 516.8 6.2 24.6 73.1
[4,] 30.4 63.4 629.6 900.0 10.5 41.3 123.5
[5,] 17.1 32.9 363.0 498.0 5.5 21.5 64.8
[6,] 2.4 20.0 0.5 115.4 3.0 12.7 35.1
[7,] 131.5 80.9 3371.0 3334.4 17.1 56.6 208.0
[8,] 18.2 20.7 433.8 488.4 3.8 13.8 44.9
[9,] 34.5 -2.1 962.8 807.5 1.1 0.1 15.2
[10,] 16.2 56.9 260.0 548.2 9.0 36.6 104.9
[11,] 29.3 50.8 643.7 839.4 8.6 33.3 101.6
[12,] 19.2 26.4 443.0 528.5 4.6 17.4 54.9
[13,] 6.3 26.2 86.1 223.8 4.1 16.8 47.7
[14,] 50.9 96.7 1088.6 1481.9 16.2 63.2 190.8
[15,] 11.3 -15.3 362.9 221.1 -1.8 -9.2 -19.7
[16,] 38.8 98.3 746.5 1199.8 16.0 63.6 186.9
[17,] 22.1 73.6 368.8 736.7 11.7 47.4 136.3
[18,] 13.5 71.1 138.3 526.0 11.0 45.4 127.3
[19,] 17.7 35.8 369.9 520.6 6.0 23.4 70.1
[20,] 4.6 18.8 65.6 163.8 2.9 12.1 34.2
[21,] 17.7 60.9 289.1 595.9 9.6 39.2 112.5
[22,] 19.2 58.8 336.5 624.1 9.4 37.9 109.7
[23,] 11.0 36.3 183.1 364.5 5.8 23.4 67.3
[24,] 25.5 43.8 559.9 728.1 7.4 28.7 87.7
[25,] 28.5 32.6 680.9 766.8 5.9 21.8 70.5
[26,] 14.5 42.3 261.2 465.3 6.8 27.3 79.3
[27,] 11.4 38.9 186.7 382.2 6.2 25.0 71.8
[28,] 8.3 44.1 83.6 324.4 6.8 28.2 78.9
[29,] 10.6 37.2 171.2 359.4 5.9 23.9 68.6
[30,] 71.6 81.4 1712.1 1924.8 14.8 54.4 176.3
[31,] 15.9 62.2 233.5 555.9 9.8 39.9 113.6
[32,] 17.4 45.1 331.6 541.4 7.3 29.2 85.6
[33,] 10.4 -4.0 301.8 233.9 -0.2 -2.1 -1.1
[34,] 10.0 30.3 177.0 324.6 4.8 19.5 56.6
[35,] 6.4 17.3 118.4 200.1 2.8 11.2 32.7
[36,] 23.2 53.9 462.5 703.5 8.8 35.0 103.6
[37,] 18.2 44.1 358.0 558.0 7.2 28.6 84.3
[38,] 27.5 4.6 745.0 660.4 1.8 4.0 22.6
[39,] 23.7 65.8 438.8 751.3 10.6 42.5 123.9
[40,] 8.1 33.2 115.3 288.9 5.2 21.3 60.5
[41,] 3.4 13.3 49.9 118.9 2.1 8.5 24.3
k <- 3
(Xtilde3 <- u[,1 : k] %*% diag (d[1 : k]) %*% t (v[,1 : k])) %>% round (1 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 27.4 58.3 46.0 114.5 9.8 42.3 131.8
[2,] 5.9 35.7 45.6 244.5 5.5 22.8 63.5
[3,] 29.6 55.5 367.1 497.7 9.4 38.5 119.3
[4,] 27.9 59.8 626.6 903.9 9.9 38.5 114.1
[5,] 39.5 65.1 390.0 463.4 11.3 46.8 148.8
[6,] 31.3 61.6 35.4 70.6 10.5 45.5 143.5
[7,] 109.1 48.7 3343.9 3369.1 11.3 31.2 123.9
[8,] 40.5 52.8 460.7 453.9 9.5 39.1 128.4
[9,] 70.9 50.2 1006.7 751.2 10.5 41.3 151.5
[10,] 22.1 65.4 267.1 539.1 10.5 43.3 127.0
[11,] 25.2 44.7 638.7 845.9 7.5 28.5 85.9
[12,] 27.3 38.0 452.8 515.9 6.7 26.6 85.4
[13,] 20.9 47.2 103.7 201.2 7.9 33.3 102.4
[14,] 31.6 68.9 1065.3 1511.8 11.2 41.3 118.5
[15,] 51.9 43.1 411.9 158.2 8.7 36.8 132.5
[16,] 16.9 66.9 720.2 1233.6 10.3 39.0 105.2
[17,] 16.9 66.1 362.5 744.8 10.3 41.5 116.8
[18,] 11.6 68.3 136.0 529.1 10.5 43.2 120.0
[19,] 25.8 47.6 379.7 508.0 8.1 32.6 100.7
[20,] 24.3 47.0 89.3 133.4 8.0 34.3 107.8
[21,] 20.2 64.5 292.1 592.1 10.3 42.0 121.7
[22,] 18.8 58.2 336.0 624.8 9.3 37.4 108.1
[23,] 28.9 62.1 204.7 336.8 10.4 43.6 134.4
[24,] 32.4 53.8 568.3 717.3 9.2 36.6 113.8
[25,] 43.2 53.8 698.7 744.0 9.7 38.4 125.7
[26,] 25.3 57.9 274.3 448.5 9.6 39.6 119.9
[27,] 24.2 57.2 202.1 362.5 9.5 39.4 119.6
[28,] 19.2 59.8 96.8 307.5 9.6 40.5 119.8
[29,] 18.4 48.4 180.6 347.3 7.9 32.7 97.7
[30,] 55.9 58.7 1693.1 1949.2 10.7 36.5 117.2
[31,] -1.5 37.2 212.5 582.8 5.2 20.2 48.4
[32,] 32.7 67.1 350.0 517.7 11.3 46.5 143.0
[33,] 47.3 49.0 346.3 176.9 9.4 39.6 137.0
[34,] 26.4 53.8 196.7 299.3 9.1 38.1 117.9
[35,] 21.9 39.7 137.1 176.0 6.8 28.8 91.0
[36,] 14.6 41.6 452.2 716.8 6.6 25.3 71.4
[37,] 36.1 69.8 379.6 530.3 11.8 48.8 151.3
[38,] 51.9 39.7 774.4 622.6 8.1 31.7 114.1
[39,] 20.4 61.0 434.8 756.4 9.7 38.7 111.4
[40,] 15.2 43.3 123.8 278.0 7.0 29.2 86.8
[41,] 28.6 49.5 80.3 79.9 8.6 37.0 118.6
cor (as.vector (X), as.vector (Xtilde2))
cor (as.vector (X), as.vector (Xtilde3))
#After running the code with a dimension of k = 2, I got a correlation of 0.996. This means k = 2 is the fewest number of dimensions required to yield a correlation between the entries of at least 0.9.
D)
Find \(\Sigma\) , the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that
The eigenvectors of \(\Sigma\) equal the columns of \(V\)
The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)
#For 1D, to verify that the eigenvectors and values of Sigma match the corresponding ingredients of the SVD, you will need to first mean-center (but not scale!) your data matrix. You can do this with scale(df, center = TRUE, scale = FALSE). Then SVD this mean-centered data.
Xc <- scale (USairpollution, center = TRUE , scale = FALSE )
svdX <- svd (Xc)
Sigma <- cov (Xc)
eigenSigma <- eigen (Sigma)
SO2 temp manu popul wind
SO2 550.947561 -73.560671 8527.7201 6711.9945 3.1753049
temp -73.560671 52.239878 -773.9713 -262.3496 -3.6113537
manu 8527.720122 -773.971341 317502.8902 311718.8140 191.5481098
popul 6711.994512 -262.349634 311718.8140 335371.8939 175.9300610
wind 3.175305 -3.611354 191.5481 175.9301 2.0410244
precip 15.001799 32.862988 -215.0199 -178.0529 -0.2185311
predays 229.929878 -82.426159 1968.9598 645.9860 6.2143902
precip predays
SO2 15.0017988 229.92988
temp 32.8629884 -82.42616
manu -215.0199024 1968.95976
popul -178.0528902 645.98598
wind -0.2185311 6.21439
precip 138.5693840 154.79290
predays 154.7929024 702.59024
t (Xc) %*% Xc / (nrow (Xc) - 1 )
SO2 temp manu popul wind
SO2 550.947561 -73.560671 8527.7201 6711.9945 3.1753049
temp -73.560671 52.239878 -773.9713 -262.3496 -3.6113537
manu 8527.720122 -773.971341 317502.8902 311718.8140 191.5481098
popul 6711.994512 -262.349634 311718.8140 335371.8939 175.9300610
wind 3.175305 -3.611354 191.5481 175.9301 2.0410244
precip 15.001799 32.862988 -215.0199 -178.0529 -0.2185311
predays 229.929878 -82.426159 1968.9598 645.9860 6.2143902
precip predays
SO2 15.0017988 229.92988
temp 32.8629884 -82.42616
manu -215.0199024 1968.95976
popul -178.0528902 645.98598
wind -0.2185311 6.21439
precip 138.5693840 154.79290
predays 154.7929024 702.59024
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0168607518 -0.099835625 0.208775573 0.95883106 -0.152191203
[2,] -0.0011417794 0.025814390 -0.071600745 -0.11014784 -0.477854201
[3,] 0.6968327936 -0.710249079 -0.067182201 -0.07319788 -0.009643654
[4,] 0.7170284512 0.692912523 0.056666935 0.04906669 0.010735457
[5,] 0.0004067530 -0.001011680 0.005386606 -0.01506609 0.025401917
[6,] -0.0004336922 0.001225937 0.265807619 -0.16261712 -0.832729325
[7,] 0.0028836950 -0.069155051 0.934279828 -0.18459052 0.232812295
[,6] [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023 1.136208e-03
[4,] 0.002915751 -5.682006e-05
[5,] 0.176541256 -9.838347e-01
[6,] 0.453206155 6.376788e-02
[7,] -0.183568735 -1.891454e-02
[,1] [,2] [,3] [,4] [,5]
[1,] -0.0168607518 0.099835625 0.208775573 -0.95883106 0.152191203
[2,] 0.0011417794 -0.025814390 -0.071600745 0.11014784 0.477854201
[3,] -0.6968327936 0.710249079 -0.067182201 0.07319788 0.009643654
[4,] -0.7170284512 -0.692912523 0.056666935 -0.04906669 -0.010735457
[5,] -0.0004067530 0.001011680 0.005386606 0.01506609 -0.025401917
[6,] 0.0004336922 -0.001225937 0.265807619 0.16261712 0.832729325
[7,] -0.0028836950 0.069155051 0.934279828 0.18459052 -0.232812295
[,6] [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023 1.136208e-03
[4,] 0.002915751 -5.682006e-05
[5,] 0.176541256 -9.838347e-01
[6,] 0.453206155 6.376788e-02
[7,] -0.183568735 -1.891454e-02
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
(svdX$ d^ 2 ) / (nrow (Xc) - 1 )
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
Problem 2
In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.
.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.
load ('Data/mystery_person_k4.Rdata' )
A)
Write a function that takes SVD ingredients u, Dd and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].
Warning: package 'magick' was built under R version 4.4.3
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
k <- 4
recompose_mystery <- mysteryU4[, 1 : k] %*% diag (mysteryD4[1 : k]) %*% t (mysteryV4[, 1 : k])
R <- max (recompose_mystery) - min (recompose_mystery)
recompose_scaled <- (recompose_mystery - min (recompose_mystery)) / R
image_mystery <- image_read (as.raster (recompose_scaled))
image_mystery
#The image that k = 4 produced was very blurry and I have no idea who the mystery villain could be. A 4-dimensional approximation is not enough to tell.
B)
I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\) , respectively) in the objects below:
load ('Data/mystery_person_k10.Rdata' )
load ('Data/mystery_person_k50.Rdata' )
Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?
k <- 10
recompose_mystery <- mysteryU10[, 1 : k] %*% diag (mysteryD10[1 : k]) %*% t (mysteryV10[, 1 : k])
R <- max (recompose_mystery) - min (recompose_mystery)
recompose_scaled <- (recompose_mystery - min (recompose_mystery)) / R
image_mystery <- image_read (as.raster (recompose_scaled))
image_mystery
#The k=10 image was much clearer. I'm pretty sure I can tell that it's Dr. Iverson, but I can't say with complete certainty yet.
k <- 50
recompose_mystery <- mysteryU50[, 1 : k] %*% diag (mysteryD50[1 : k]) %*% t (mysteryV50[, 1 : k])
R <- max (recompose_mystery) - min (recompose_mystery)
recompose_scaled <- (recompose_mystery - min (recompose_mystery)) / R
image_mystery <- image_read (as.raster (recompose_scaled))
image_mystery
#The k=50 image was very clear, and at this point I can say with confidence that it is Dr. Iverson.
C)
How many numbers need to be stored in memory for each of the following:
A full \(700\times 600\) image?
A 4-dimensional approximation?
A 10-dimensional approximation?
A 50-dimensional approximation?
#Full 700 x 600 image = 420,000
#4-dimensional = (700x4+4+4^2) = 2820
#10-dimensional = (700x10+10+10^2) = 7110
#50-dimensional = (700x50+50+50^2) = 37,550