Activity 3.1 - SVD

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  1. Submit a link to your published solutions

Problem 1

Reconsider the US air pollution data set:

library(HSAUR2)
Loading required package: tools
data(USairpollution)
head(USairpollution)
            SO2 temp manu popul wind precip predays
Albany       46 47.6   44   116  8.8  33.36     135
Albuquerque  11 56.8   46   244  8.9   7.77      58
Atlanta      24 61.5  368   497  9.1  48.34     115
Baltimore    47 55.0  625   905  9.6  41.31     111
Buffalo      11 47.1  391   463 12.4  36.11     166
Charleston   31 55.2   35    71  6.5  40.75     148

A)

Perform singular value decomposition of this data matrix. Then create the matrix \(D\). Describe what this matrix looks like.

components <- svd(USairpollution)
U <- components$u
D <- components$d
V <- components$v
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
U %>% round(2)
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
 [1,] -0.02  0.10  0.21 -0.25  0.01 -0.07 -0.04
 [2,] -0.03  0.15  0.00  0.04 -0.19 -0.40  0.03
 [3,] -0.09  0.08  0.11  0.11 -0.06  0.13 -0.02
 [4,] -0.16  0.13 -0.02 -0.19 -0.11  0.10  0.07
 [5,] -0.09  0.06  0.20  0.11  0.44 -0.11  0.15
 [6,] -0.01  0.09  0.25 -0.05  0.09 -0.04 -0.37
 [7,] -0.67 -0.44 -0.20  0.01 -0.04  0.05 -0.08
 [8,] -0.09 -0.01  0.19  0.16  0.12 -0.02 -0.28
 [9,] -0.18 -0.24  0.32  0.03  0.08 -0.12 -0.01
[10,] -0.08  0.19  0.05 -0.14  0.14 -0.01 -0.10
[11,] -0.15  0.07 -0.04  0.29 -0.13 -0.03  0.16
[12,] -0.10  0.01  0.07  0.13 -0.01 -0.35 -0.03
[13,] -0.03  0.10  0.13  0.04  0.02 -0.07  0.28
[14,] -0.26  0.17 -0.17 -0.17  0.21 -0.06  0.00
[15,] -0.06 -0.15  0.35  0.02 -0.12  0.08  0.03
[16,] -0.20  0.26 -0.19  0.10 -0.01  0.14  0.09
[17,] -0.11  0.24 -0.05 -0.19  0.07  0.05  0.07
[18,] -0.07  0.28 -0.02  0.01 -0.07  0.21 -0.08
[19,] -0.09  0.07  0.07  0.16  0.00  0.02  0.14
[20,] -0.02  0.07  0.17  0.22 -0.12  0.15 -0.03
[21,] -0.09  0.20  0.02 -0.14  0.01  0.09 -0.09
[22,] -0.10  0.18  0.00  0.13 -0.02  0.18  0.03
[23,] -0.06  0.12  0.16  0.29 -0.06  0.18 -0.17
[24,] -0.13  0.06  0.06  0.07  0.24 -0.10  0.24
[25,] -0.15 -0.02  0.13  0.03  0.26 -0.17  0.07
[26,] -0.08  0.13  0.09  0.09  0.01  0.10 -0.14
[27,] -0.06  0.13  0.11  0.25 -0.07  0.22 -0.10
[28,] -0.04  0.18  0.10 -0.09 -0.10  0.09  0.15
[29,] -0.06  0.12  0.07  0.05  0.01 -0.08  0.24
[30,] -0.37 -0.05 -0.14 -0.13 -0.07  0.10 -0.02
[31,] -0.08  0.22 -0.15  0.06 -0.37 -0.44 -0.32
[32,] -0.09  0.12  0.13 -0.38  0.00 -0.04 -0.06
[33,] -0.05 -0.09  0.32 -0.39 -0.38  0.10  0.21
[34,] -0.05  0.09  0.14  0.04 -0.06  0.06 -0.15
[35,] -0.03  0.05  0.14 -0.02 -0.10 -0.31 -0.02
[36,] -0.12  0.13 -0.08  0.10 -0.12 -0.19  0.02
[37,] -0.09  0.11  0.16 -0.08  0.29 -0.06 -0.15
[38,] -0.14 -0.16  0.21  0.08 -0.21 -0.03  0.02
[39,] -0.12  0.19 -0.03 -0.10 -0.03  0.04  0.01
[40,] -0.04  0.12  0.06  0.15 -0.09 -0.08  0.43
[41,] -0.02  0.05  0.22 -0.03 -0.12  0.04  0.03
D %>% round(2)
[1] 7051.95  931.12  540.46   92.71   85.24   52.95   10.14
V %>% round(2)
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,] -0.03  0.00  0.21 -0.82 -0.54  0.03  0.01
[2,] -0.03  0.20  0.30  0.50 -0.67 -0.39 -0.11
[3,] -0.65 -0.71  0.26  0.09  0.00 -0.01  0.00
[4,] -0.75  0.57 -0.33 -0.07  0.01  0.01  0.00
[5,] -0.01  0.03  0.05  0.04 -0.04 -0.11  0.99
[6,] -0.02  0.13  0.24  0.24 -0.22  0.91  0.06
[7,] -0.07  0.34  0.79 -0.11  0.47 -0.12 -0.04
D_matrix <- diag(D)
D_matrix
         [,1]     [,2]    [,3]     [,4]     [,5]    [,6]    [,7]
[1,] 7051.949   0.0000   0.000  0.00000  0.00000  0.0000  0.0000
[2,]    0.000 931.1211   0.000  0.00000  0.00000  0.0000  0.0000
[3,]    0.000   0.0000 540.463  0.00000  0.00000  0.0000  0.0000
[4,]    0.000   0.0000   0.000 92.70909  0.00000  0.0000  0.0000
[5,]    0.000   0.0000   0.000  0.00000 85.23724  0.0000  0.0000
[6,]    0.000   0.0000   0.000  0.00000  0.00000 52.9465  0.0000
[7,]    0.000   0.0000   0.000  0.00000  0.00000  0.0000 10.1409

The matrix D has 7 columns and where the diagonal numbers in the middle are the values produced by the components. All other values in the matrix are 0.

B)

Verify that \(X=UDV^T\) by plotting all the entries of \(X\) versus all the entries of \(UDV^T\) with the 0/1 line.

(U %*% diag(D) %*% t(V)) %>% round(1)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
 [1,]   46 47.6   44  116  8.8 33.4  135
 [2,]   11 56.8   46  244  8.9  7.8   58
 [3,]   24 61.5  368  497  9.1 48.3  115
 [4,]   47 55.0  625  905  9.6 41.3  111
 [5,]   11 47.1  391  463 12.4 36.1  166
 [6,]   31 55.2   35   71  6.5 40.7  148
 [7,]  110 50.6 3344 3369 10.4 34.4  122
 [8,]   23 54.0  462  453  7.1 39.0  132
 [9,]   65 49.7 1007  751 10.9 35.0  155
[10,]   26 51.5  266  540  8.6 37.0  134
[11,]    9 66.2  641  844 10.9 35.9   78
[12,]   17 51.9  454  515  9.0 12.9   86
[13,]   17 49.0  104  201 11.2 30.8  103
[14,]   35 49.9 1064 1513 10.1 31.0  129
[15,]   56 49.1  412  158  9.0 43.4  127
[16,]   10 68.9  721 1233 10.8 48.2  103
[17,]   28 52.3  361  746  9.7 38.7  121
[18,]   14 68.4  136  529  8.8 54.5  116
[19,]   14 54.5  381  507 10.0 37.0   99
[20,]   13 61.0   91  132  8.2 48.5  100
[21,]   30 55.6  291  593  8.3 43.1  123
[22,]   10 61.6  337  624  9.2 49.1  105
[23,]   10 75.5  207  335  9.0 59.8  128
[24,]   16 45.7  569  717 11.8 29.1  123
[25,]   29 43.5  699  744 10.6 25.9  137
[26,]   18 59.4  275  448  7.9 46.0  119
[27,]    9 68.3  204  361  8.4 56.8  113
[28,]   31 59.3   96  308 10.6 44.7  116
[29,]   14 51.5  181  347 10.9 30.2   98
[30,]   69 54.6 1692 1950  9.6 39.9  115
[31,]   10 70.3  213  582  6.0  7.0   36
[32,]   61 50.4  347  520  9.4 36.2  147
[33,]   94 50.0  343  179 10.6 42.7  125
[34,]   26 57.8  197  299  7.6 42.6  115
[35,]   28 51.0  137  176  8.7 15.2   89
[36,]   12 56.7  453  716  8.7 20.7   67
[37,]   29 51.1  379  531  9.4 38.8  164
[38,]   56 55.9  775  622  9.5 35.9  105
[39,]   29 57.3  434  757  9.3 38.9  111
[40,]    8 56.6  125  277 12.7 30.6   82
[41,]   36 54.0   80   80  9.0 40.2  114
head(USairpollution)
            SO2 temp manu popul wind precip predays
Albany       46 47.6   44   116  8.8  33.36     135
Albuquerque  11 56.8   46   244  8.9   7.77      58
Atlanta      24 61.5  368   497  9.1  48.34     115
Baltimore    47 55.0  625   905  9.6  41.31     111
Buffalo      11 47.1  391   463 12.4  36.11     166
Charleston   31 55.2   35    71  6.5  40.75     148

When looking at the points, it is clear to see that the values are very close to the origianal values in the data set.

USairpollution_recomposed2 <- (U %*% diag(D) %*% t(V)) %>% round(1)
k <- 7
(Xtilde2 <- U[,1:k] %*% diag(D_matrix[1:k]) %*% t(V[,1:k])) %>% round(1)
       [,1]  [,2]   [,3]   [,4] [,5]  [,6]  [,7]
 [1,]   3.6   4.5   84.5   98.0  0.8   3.0   9.6
 [2,]   6.1   7.7  143.6  166.5  1.4   5.1  16.3
 [3,]  17.4  21.9  407.6  472.7  3.9  14.5  46.4
 [4,]  30.5  38.5  716.5  830.9  6.9  25.6  81.5
 [5,]  17.1  21.6  402.4  466.7  3.8  14.4  45.8
 [6,]   2.5   3.2   59.2   68.7  0.6   2.1   6.7
 [7,] 131.0 165.3 3076.9 3568.4 29.4 109.8 350.1
 [8,]  18.2  22.9  426.3  494.4  4.1  15.2  48.5
 [9,]  34.3  43.2  804.7  933.3  7.7  28.7  91.6
[10,]  16.4  20.7  386.1  447.8  3.7  13.8  43.9
[11,]  29.4  37.1  691.2  801.6  6.6  24.7  78.6
[12,]  19.2  24.2  450.6  522.5  4.3  16.1  51.3
[13,]   6.4   8.0  149.5  173.4  1.4   5.3  17.0
[14,]  51.1  64.5 1200.8 1392.6 11.5  42.8 136.6
[15,]  11.1  14.0  260.7  302.4  2.5   9.3  29.7
[16,]  39.1  49.3  917.4 1063.9  8.8  32.7 104.4
[17,]  22.4  28.3  526.8  611.0  5.0  18.8  59.9
[18,]  13.9  17.5  325.3  377.3  3.1  11.6  37.0
[19,]  17.7  22.4  416.8  483.3  4.0  14.9  47.4
[20,]   4.7   5.9  110.5  128.1  1.1   3.9  12.6
[21,]  18.0  22.7  422.4  489.9  4.0  15.1  48.1
[22,]  19.4  24.5  456.1  528.9  4.4  16.3  51.9
[23,]  11.1  14.0  260.9  302.6  2.5   9.3  29.7
[24,]  25.6  32.2  600.2  696.0  5.7  21.4  68.3
[25,]  28.5  36.0  669.2  776.1  6.4  23.9  76.1
[26,]  14.7  18.5  344.2  399.2  3.3  12.3  39.2
[27,]  11.6  14.6  271.4  314.8  2.6   9.7  30.9
[28,]   8.5  10.7  199.9  231.8  1.9   7.1  22.7
[29,]  10.8  13.6  253.4  293.9  2.4   9.0  28.8
[30,]  71.6  90.3 1681.0 1949.5 16.1  60.0 191.3
[31,]  16.1  20.4  379.3  439.8  3.6  13.5  43.2
[32,]  17.5  22.1  411.8  477.6  3.9  14.7  46.9
[33,]  10.3  13.0  242.4  281.2  2.3   8.7  27.6
[34,]  10.1  12.8  238.0  276.0  2.3   8.5  27.1
[35,]   6.4   8.1  150.5  174.5  1.4   5.4  17.1
[36,]  23.3  29.4  548.0  635.5  5.2  19.6  62.3
[37,]  18.4  23.2  431.0  499.9  4.1  15.4  49.0
[38,]  27.3  34.4  640.9  743.3  6.1  22.9  72.9
[39,]  24.0  30.2  562.8  652.6  5.4  20.1  64.0
[40,]   8.3  10.5  194.7  225.7  1.9   6.9  22.1
[41,]   3.5   4.4   81.1   94.0  0.8   2.9   9.2
plot(USairpollution_recomposed2, Xtilde2); abline(0,1)

cor(as.vector(USairpollution_recomposed2), as.vector(Xtilde2))
[1] 0.9863995

C)

Consider low-dimensional approximations of the data matrix. What is the fewest number of dimensions required to yield a correlation between the entries of \(X\) and \(\tilde X\) of at least 0.9?

k <- 2
(Xtilde2 <- U[,1:k] %*% diag(D_matrix[1:k]) %*% t(V[,1:k])) %>% round(1)
       [,1]  [,2]   [,3]   [,4] [,5]  [,6]  [,7]
 [1,]   3.6   4.5   84.5   98.0  0.8   3.0   9.6
 [2,]   6.1   7.7  143.6  166.5  1.4   5.1  16.3
 [3,]  17.4  21.9  407.6  472.7  3.9  14.5  46.4
 [4,]  30.5  38.5  716.5  830.9  6.9  25.6  81.5
 [5,]  17.1  21.6  402.4  466.7  3.8  14.4  45.8
 [6,]   2.5   3.2   59.2   68.7  0.6   2.1   6.7
 [7,] 131.0 165.3 3076.9 3568.4 29.4 109.8 350.1
 [8,]  18.2  22.9  426.3  494.4  4.1  15.2  48.5
 [9,]  34.3  43.2  804.7  933.3  7.7  28.7  91.6
[10,]  16.4  20.7  386.1  447.8  3.7  13.8  43.9
[11,]  29.4  37.1  691.2  801.6  6.6  24.7  78.6
[12,]  19.2  24.2  450.6  522.5  4.3  16.1  51.3
[13,]   6.4   8.0  149.5  173.4  1.4   5.3  17.0
[14,]  51.1  64.5 1200.8 1392.6 11.5  42.8 136.6
[15,]  11.1  14.0  260.7  302.4  2.5   9.3  29.7
[16,]  39.1  49.3  917.4 1063.9  8.8  32.7 104.4
[17,]  22.4  28.3  526.8  611.0  5.0  18.8  59.9
[18,]  13.9  17.5  325.3  377.3  3.1  11.6  37.0
[19,]  17.7  22.4  416.8  483.3  4.0  14.9  47.4
[20,]   4.7   5.9  110.5  128.1  1.1   3.9  12.6
[21,]  18.0  22.7  422.4  489.9  4.0  15.1  48.1
[22,]  19.4  24.5  456.1  528.9  4.4  16.3  51.9
[23,]  11.1  14.0  260.9  302.6  2.5   9.3  29.7
[24,]  25.6  32.2  600.2  696.0  5.7  21.4  68.3
[25,]  28.5  36.0  669.2  776.1  6.4  23.9  76.1
[26,]  14.7  18.5  344.2  399.2  3.3  12.3  39.2
[27,]  11.6  14.6  271.4  314.8  2.6   9.7  30.9
[28,]   8.5  10.7  199.9  231.8  1.9   7.1  22.7
[29,]  10.8  13.6  253.4  293.9  2.4   9.0  28.8
[30,]  71.6  90.3 1681.0 1949.5 16.1  60.0 191.3
[31,]  16.1  20.4  379.3  439.8  3.6  13.5  43.2
[32,]  17.5  22.1  411.8  477.6  3.9  14.7  46.9
[33,]  10.3  13.0  242.4  281.2  2.3   8.7  27.6
[34,]  10.1  12.8  238.0  276.0  2.3   8.5  27.1
[35,]   6.4   8.1  150.5  174.5  1.4   5.4  17.1
[36,]  23.3  29.4  548.0  635.5  5.2  19.6  62.3
[37,]  18.4  23.2  431.0  499.9  4.1  15.4  49.0
[38,]  27.3  34.4  640.9  743.3  6.1  22.9  72.9
[39,]  24.0  30.2  562.8  652.6  5.4  20.1  64.0
[40,]   8.3  10.5  194.7  225.7  1.9   6.9  22.1
[41,]   3.5   4.4   81.1   94.0  0.8   2.9   9.2
cor(as.vector(USairpollution_recomposed2), as.vector(Xtilde2))
[1] 0.9863995
cor(as.vector(USairpollution_recomposed2), as.vector(Xtilde2))
[1] 0.9863995

Using 2 dimensions, I got a correlation of 0.986.

D)

Find \(\Sigma\), the covariance matrix of this data set. Then perform eigen-decomposition of this matrix. Verify that

  • The eigenvectors of \(\Sigma\) equal the columns of \(V\)
  • The eigenvalues of \(\Sigma\) equal the diagonals of \(D^2/(n-1)\)
n <- nrow(USairpollution)
USairpollution_centered <- scale(USairpollution, center = TRUE, scale = FALSE)
Sigma <- cov(USairpollution)
eigen_decomp <- eigen(Sigma)
eigenvalues_from_Sigma <- eigen_decomp$values
eigenvectors_from_Sigma <- eigen_decomp$vectors

eigenvalues_from_Sigma
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
eigenvectors_from_Sigma
              [,1]         [,2]         [,3]        [,4]         [,5]
[1,]  0.0168607518 -0.099835625  0.208775573  0.95883106 -0.152191203
[2,] -0.0011417794  0.025814390 -0.071600745 -0.11014784 -0.477854201
[3,]  0.6968327936 -0.710249079 -0.067182201 -0.07319788 -0.009643654
[4,]  0.7170284512  0.692912523  0.056666935  0.04906669  0.010735457
[5,]  0.0004067530 -0.001011680  0.005386606 -0.01506609  0.025401917
[6,] -0.0004336922  0.001225937  0.265807619 -0.16261712 -0.832729325
[7,]  0.0028836950 -0.069155051  0.934279828 -0.18459052  0.232812295
             [,6]          [,7]
[1,] -0.053952911 -2.704138e-02
[2,] -0.852534945 -1.640507e-01
[3,] -0.002153023  1.136208e-03
[4,]  0.002915751 -5.682006e-05
[5,]  0.176541256 -9.838347e-01
[6,]  0.453206155  6.376788e-02
[7,] -0.183568735 -1.891454e-02
svd_decomp <- svd(USairpollution_centered)
D_values <- svd_decomp$d
V <- svd_decomp$v

D_values
[1] 5053.60055  769.72821  167.56610   90.55428   68.32413   21.96092    7.61237
print(all.equal(abs(eigenvectors_from_Sigma), abs(V)))
[1] TRUE
eigenvalues_from_SVD <- (D_values^2) / (n - 1)

eigenvalues_from_Sigma
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
eigenvalues_from_SVD
[1] 6.384720e+05 1.481204e+04 7.019599e+02 2.050019e+02 1.167047e+02
[6] 1.205705e+01 1.448704e+00
print(all.equal(eigenvalues_from_Sigma, eigenvalues_from_SVD))
[1] TRUE

Problem 2

In this problem we explore how “high” a low-dimensional SVD approximation of an image has to be before you can recognize it.

.Rdata objects are essentially R workspace memory snapshots that, when loaded, load any type of R object that you want into your global environment. The command below, when executed, will load three objects into your memory: mysteryU4, mysteryD4, and mysteryV4. These are the first \(k\) vectors and singular values of an SVD I performed on a 700-pixels-tall \(\times\) 600-pixels-wide image of a well-known villain.

load('Data/mystery_person_k4.Rdata')

A)

Write a function that takes SVD ingredients u, d and v and renders the \(700 \times 600\) image produced by this approximation using functions from the magick package. Use your function to determine whether a 4-dimensional approximation to this image is enough for you to tell who the mystery villain is. Recall that you will likely need to rescale your recomposed approximation so that all pixels are in [0,1].

library(magick)
Linking to ImageMagick 6.9.13.29
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11
render_svd_image <- function(u, d, v) {
  X_approx <- u %*% diag(d) %*% t(v)
  X_rescaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))
  m_height <- dim(u)[1]
  n_width <- dim(v)[1]
  img_array <- array(X_rescaled, dim = c(m_height, n_width, 1))
  img <- image_read(img_array)
  return(img)
}
image_k4 <- render_svd_image(mysteryU4, mysteryD4, mysteryV4)
print(image_k4)
# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       FALSE        0 72x72  

Using only 4 dimensions is not enough to be able to tell what the image is.

B)

I’m giving you slightly higher-dimensional approximations (\(k=10\) and \(k=50\), respectively) in the objects below:

load('Data/mystery_person_k10.Rdata')
load('Data/mystery_person_k50.Rdata')

Create both of the images produced by these approximations. At what point can you tell who the mystery villain is?

render_svd_image <- function(u, d, v) {
  X_approx <- u %*% diag(d) %*% t(v)
  X_rescaled <- (X_approx - min(X_approx)) / (max(X_approx) - min(X_approx))
  m_height <- dim(u)[1]
  n_width <- dim(v)[1]
  img_array <- array(X_rescaled, dim = c(m_height, n_width, 1))
  img <- image_read(img_array)
  return(img)
}


image_k10 <- render_svd_image(mysteryU10, mysteryD10, mysteryV10)
print(image_k10)
# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       FALSE        0 72x72  

image_k50 <- render_svd_image(mysteryU50, mysteryD50, mysteryV50)
print(image_k50)
# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG      600    700 Gray       FALSE        0 72x72  

No, when k = 10, you cannot determine who the picture is of. When k = 50 you can see it is Todd Iverson.

C)

How many numbers need to be stored in memory for each of the following:

  • A full \(700\times 600\) image? You would need 600 * 700, so 42000.
  • A 4-dimensional approximation? (700 * k) + k + (600 * k) = k * 1301 1301 * 4 = 5204
  • A 10-dimensional approximation? 1301 * 10 = 13010
  • A 50-dimensional approximation? 1301 * 50 = 60050