3.1 - Singular value decomposition

Motivation

  • We are moving towards principal component analysis (PCA), arguably the most popular form of dimension reduction.
  • Goal of PCA: create a low-dimensional representation of data matrix \(X\) that well-approximates \(X\)
  • Singular value decomposition (SVD) is the best tool for performing PCA, so we need to understand it first.
  • In these slides:
    • What is SVD?
    • How do we use SVD to form a lower-dimension approximation of \(X\)?
    • How “good” is the low-dimensional approximation?

Singular value decomposition

Given \(n\times p\) data matrix \(X\), singular value decomposition (SVD) decomposes \(X\) into 3 matrices:

\[X = UDV^T\]

  • \(dim(U) = n\times p\)
  • \(dim(D) = p\times p\) diagonal matrix of the singular values
  • \(dim(V) = p\times p\)

Notes:

  • \(X\) must contain entirely numeric values
  • Above dimensions hold if \(p < n\) (most common); decomposition still defined if \(p>n\) but the dimensions differ.
  • There’s a close relationship between SVD and eigen analysis of \(Cov(X)\): more later.

What does this have to do with dimension reduction?

  • A \(n\times p\) matrix \(X\) contains \(np\) numbers
  • A lower-dimensional representation of \(X\) looks like it has the same dimension, but requires (potentially many) fewer numbers stored in memory

Example: full recomposition

Suppose \(n=9\) and \(p=5\):

\[X_{n\times p} = U_{n\times p}D_{p\times p}V_{p\times p}^T\] \[\begin{pmatrix} X_{11} & X_{12} & X_{13} & X_{14} & X_{15} \\ X_{21} & X_{22} & X_{23} & X_{24} & X_{25} \\ X_{31} & X_{32} & X_{33} & X_{34} & X_{35} \\ X_{41} & X_{42} & X_{43} & X_{44} & X_{45} \\ X_{51} & X_{52} & X_{53} & X_{54} & X_{55} \\ X_{61} & X_{62} & X_{63} & X_{64} & X_{65} \\ X_{71} & X_{72} & X_{73} & X_{74} & X_{75} \\ X_{81} & X_{82} & X_{83} & X_{84} & X_{85} \\ X_{91} & X_{92} & X_{93} & X_{94} & X_{95} \\ \end{pmatrix} = \begin{pmatrix} U_{11} & U_{12} & U_{13} & U_{14} & U_{15} \\ U_{21} & U_{22} & U_{23} & U_{24} & U_{25} \\ U_{31} & U_{32} & U_{33} & U_{34} & U_{35} \\ U_{41} & U_{42} & U_{43} & U_{44} & U_{45} \\ U_{51} & U_{52} & U_{53} & U_{54} & U_{55} \\ U_{61} & U_{62} & U_{63} & U_{64} & U_{65} \\ U_{71} & U_{72} & U_{73} & U_{74} & U_{75} \\ U_{81} & U_{82} & U_{83} & U_{84} & U_{85} \\ U_{91} & U_{92} & U_{93} & U_{94} & U_{95} \end{pmatrix}\begin{pmatrix} D_1 & 0 & 0 & 0 & 0 \\ 0 & D_2 & 0 & 0 & 0 \\ 0 & 0 & D_3 & 0 & 0 \\ 0 & 0 & 0 & D_4 & 0 \\ 0 & 0 & 0 & 0 & D_5 \end{pmatrix} \begin{pmatrix} V_{11} & V_{12} & V_{13} & V_{14} & V_{15} \\ V_{21} & V_{22} & V_{23} & V_{24} & V_{25} \\ V_{31} & V_{32} & V_{33} & V_{34} & V_{35} \\ V_{41} & V_{42} & V_{43} & V_{44} & V_{45} \\ V_{51} & V_{52} & V_{53} & V_{54} & V_{55} \end{pmatrix}^T\]

  • Note, \(p=5\) is the “full dimensional” representation of \(X\).
  • \(X\) has 45 numbers

Low-dimension approximation

  • We can approximate \(X\) by using the first \(k\) columns of \(U\) and \(V\) and the first \(k\) singular values of \(D\).
  • Visual for \(k=2\):

\[\color{red}{\tilde X_{n\times p}} = \begin{pmatrix} \color{red} {U_{11}} & \color{red} {U_{12}} & U_{13} & U_{14} & U_{15} \\ \color{red} {U_{21}} & \color{red} {U_{22}} & U_{23} & U_{24} & U_{25} \\ \color{red} {U_{31}} & \color{red} {U_{32}} & U_{33} & U_{34} & U_{35} \\ \color{red} {U_{41}} & \color{red} {U_{42}} & U_{43} & U_{44} & U_{45} \\ \color{red} {U_{51}} & \color{red} {U_{52}} & U_{53} & U_{54} & U_{55} \\ \color{red} {U_{61}} & \color{red} {U_{62}} & U_{63} & U_{64} & U_{65} \\ \color{red} {U_{71}} & \color{red} {U_{72}} & U_{73} & U_{74} & U_{75} \\ \color{red} {U_{81}} & \color{red} {U_{82}} & U_{83} & U_{84} & U_{85} \\ \color{red} {U_{91}} & \color{red} {U_{92}} & U_{93} & U_{94} & U_{95} \end{pmatrix}\begin{pmatrix} \color{red}{D_1} & 0 & 0 & 0 & 0 \\ 0 & \color{red}{D_2} & 0 & 0 & 0 \\ 0 & 0 & D_3 & 0 & 0 \\ 0 & 0 & 0 & D_4 & 0 \\ 0 & 0 & 0 & 0 & D_5 \end{pmatrix}\begin{pmatrix} \color{red} {V_{11}} & \color{red} {V_{12}} & V_{13} & V_{14} & V_{15} \\ \color{red} {V_{21}} & \color{red} {V_{22}} & V_{23} & V_{24} & V_{25} \\ \color{red} {V_{31}} & \color{red} {V_{32}} & V_{33} & V_{34} & V_{35} \\ \color{red} {V_{41}} & \color{red} {V_{42}} & V_{43} & V_{44} & V_{45} \\ \color{red} {V_{51}} & \color{red} {V_{52}} & V_{53} & V_{54} & V_{55} \end{pmatrix}^T\]

  • Note that \(\color{red}{\tilde X_{n\times p}}\) still contains 45 numbers
  • However, only \((9\times 2)+2+ (2\times 5) = 30\) numbers need to be stored in memory to create this approximation

SVD in R

Consider the example below:

library(tidyverse)
set.seed(83025)
X <- matrix(sample(0:9, 9*5, replace=TRUE), 
            nrow=9, ncol=5)
X
      [,1] [,2] [,3] [,4] [,5]
 [1,]    9    4    9    5    1
 [2,]    5    2    8    3    9
 [3,]    0    4    0    5    9
 [4,]    4    2    1    9    7
 [5,]    8    5    5    6    9
 [6,]    7    4    5    0    3
 [7,]    3    5    0    0    6
 [8,]    5    0    7    6    6
 [9,]    0    0    7    8    9
components <- svd(X)
U <- components$u
D <- components$d
V <- components$v
U %>% round(2)
       [,1]  [,2]  [,3]  [,4]  [,5]
 [1,] -0.35  0.60  0.16 -0.32 -0.54
 [2,] -0.39  0.06  0.05  0.58  0.26
 [3,] -0.26 -0.49 -0.30  0.00 -0.42
 [4,] -0.33 -0.30  0.01 -0.65  0.30
 [5,] -0.45  0.04 -0.28 -0.15  0.22
 [6,] -0.24  0.41 -0.28  0.16  0.04
 [7,] -0.18 -0.08 -0.60  0.17 -0.11
 [8,] -0.35  0.06  0.34  0.02  0.41
 [9,] -0.37 -0.36  0.50  0.25 -0.37
D %>% round(2)
[1] 33.15 12.78  9.36  6.50  2.75
V %>% round(2)
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] -0.42  0.60 -0.28 -0.31  0.53
[2,] -0.24  0.11 -0.64 -0.11 -0.72
[3,] -0.46  0.45  0.52  0.47 -0.32
[4,] -0.45 -0.33  0.42 -0.70 -0.17
[5,] -0.60 -0.56 -0.26  0.43  0.28

Putting \(X\) back together

X
      [,1] [,2] [,3] [,4] [,5]
 [1,]    9    4    9    5    1
 [2,]    5    2    8    3    9
 [3,]    0    4    0    5    9
 [4,]    4    2    1    9    7
 [5,]    8    5    5    6    9
 [6,]    7    4    5    0    3
 [7,]    3    5    0    0    6
 [8,]    5    0    7    6    6
 [9,]    0    0    7    8    9
(U %*% diag(D) %*% t(V)) %>% round(1)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    9    4    9    5    1
 [2,]    5    2    8    3    9
 [3,]    0    4    0    5    9
 [4,]    4    2    1    9    7
 [5,]    8    5    5    6    9
 [6,]    7    4    5    0    3
 [7,]    3    5    0    0    6
 [8,]    5    0    7    6    6
 [9,]    0    0    7    8    9

Approximating \(X\) with \(k=2\) singular values

X
      [,1] [,2] [,3] [,4] [,5]
 [1,]    9    4    9    5    1
 [2,]    5    2    8    3    9
 [3,]    0    4    0    5    9
 [4,]    4    2    1    9    7
 [5,]    8    5    5    6    9
 [6,]    7    4    5    0    3
 [7,]    3    5    0    0    6
 [8,]    5    0    7    6    6
 [9,]    0    0    7    8    9
k <- 2
(Xtilde2 <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k])) %>% round(1)
      [,1] [,2] [,3] [,4] [,5]
 [1,]  9.6  3.7  8.7  2.7  2.7
 [2,]  5.9  3.2  6.2  5.6  7.3
 [3,] -0.2  1.4  1.1  6.0  8.6
 [4,]  2.3  2.2  3.2  6.2  8.6
 [5,]  6.6  3.7  7.0  6.5  8.6
 [6,]  6.5  2.5  5.9  1.8  1.8
 [7,]  1.9  1.4  2.3  3.1  4.2
 [8,]  5.3  2.9  5.6  4.9  6.5
 [9,]  2.4  2.4  3.5  7.0  9.8

Approximating \(X\) with \(k=3\) singular values

X
      [,1] [,2] [,3] [,4] [,5]
 [1,]    9    4    9    5    1
 [2,]    5    2    8    3    9
 [3,]    0    4    0    5    9
 [4,]    4    2    1    9    7
 [5,]    8    5    5    6    9
 [6,]    7    4    5    0    3
 [7,]    3    5    0    0    6
 [8,]    5    0    7    6    6
 [9,]    0    0    7    8    9
k <- 3
(Xtilde3 <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k])) %>% round(1)
      [,1] [,2] [,3] [,4] [,5]
 [1,]  9.1  2.7  9.5  3.3  2.3
 [2,]  5.8  2.9  6.5  5.7  7.2
 [3,]  0.6  3.2 -0.4  4.8  9.3
 [4,]  2.2  2.1  3.3  6.2  8.6
 [5,]  7.4  5.3  5.7  5.4  9.2
 [6,]  7.3  4.2  4.6  0.7  2.5
 [7,]  3.5  4.9 -0.6  0.7  5.6
 [8,]  4.4  0.8  7.3  6.3  5.6
 [9,]  1.0 -0.6  5.9  8.9  8.6

\(X\) vs \(\tilde X\)

\(k=2\):

plot(X, Xtilde2); abline(0,1)

cor(as.vector(X), as.vector(Xtilde2))
[1] 0.8203752

\(k=3\):

plot(X, Xtilde3); abline(0,1)

cor(as.vector(X), as.vector(Xtilde3))
[1] 0.9387363

SVD application: image compression

  • Using SVD to approximate a data matrix in lower dimensions has application in image compression
  • Grayscale images can be represented as \(n\times p\) matrices of decimals
    • 1 = full white
    • 0 = full black
  • Full-color images are, when represented in rgb format, simply 3 \(n\times p\) matrices with decimals representing amount of red, green, and blue in each pixel - this is called a \(n \times p \times 3\) array.

Images as matrices

  • The magick package is excellent for image analysis in R
  • We’ll work with a creative commons licensed image from here
library(magick)
url <- 'https://upload.wikimedia.org/wikipedia/commons/8/8f/Taylor_Swift_GMA_2012.jpg'
(color_taylor<- image_read(url)
      %>% image_resize(geometry = geometry_size_pixels(width=200))
)

(grey_taylor <- color_taylor
      %>% image_convert(type='grayscale')
)

Image \(\rightarrow\) data matrix

Note that this is a \(n=275\)-by-\(p = 200\) matrix!

print(grey_taylor)
# A tibble: 1 × 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 JPEG     200    275 Gray       FALSE        0 72x72  

(taylor_data <- grey_taylor
    %>% image_data
    %>% as.numeric
    %>% matrix(nrow = 275, ncol = 200)
)[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 0.5529412 0.5647059 0.5607843 0.5411765 0.5215686
[2,] 0.5607843 0.5568627 0.5490196 0.5294118 0.5137255
[3,] 0.5647059 0.5568627 0.5411765 0.5372549 0.5176471
[4,] 0.5568627 0.5490196 0.5372549 0.5294118 0.5176471
[5,] 0.5529412 0.5607843 0.5490196 0.5450980 0.5137255

Total number of entries: \(275 \times 200 = 55,000\)

Compression with \(k=2\)

  • SVD the data matrix:
decompose_taylor <- svd(taylor_data)
U <- decompose_taylor$u
D <- decompose_taylor$d
V <- decompose_taylor$v
  • Low dimensional approximation:
k <- 2
recompose_taylor <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k])
  • Approximations may result in negative or entries \(>1\); valid grayscale must be \(\in [0,1]\):
R <- max(recompose_taylor)-min(recompose_taylor)
recomposed_scaled_taylor <- (recompose_taylor-min(recompose_taylor))/R

Data matrix \(\rightarrow\) image

From data matrix to image:

(recomposed_scaled_taylor
  %>% as.raster
  %>% image_read
 )

…okay, so pretty bad!

But remember this was only a 2-dimensional representation of 200-dimensional “data”! (And even here, you can kind of see a smeared eye/nose/mouth…)

Higher-dimensional approximations

Write a function to recompose the image for various \(k\) (remember, original image required 55,000 numbers in memory):

recompose <- \(k) {
  recompose_taylor <- U[,1:k] %*% diag(D[1:k]) %*% t(V[,1:k])
  R <- max(recompose_taylor)-min(recompose_taylor)
  recomposed_scaled_data <- (recompose_taylor-min(recompose_taylor))/R
  recomposed_image <- (recomposed_scaled_data
                       %>% as.raster
                       %>% image_read
                       )
  return(recomposed_image)
}

\(k = 5\)
\((275\times 5 + 5 + 5^2 = 1405)\)

recompose(5)

\(k = 10\)
\((275\times 10 + 10 + 10^2 = 2860)\)

recompose(10)

\(k = 20\)
\((275\times 20 + 20 + 20^2) = 5920\)

recompose(20)

What’s with \(U\), \(D\), and \(V\)?

  • We’ve seen how:
    • We can decompose \(X\) into ingredients \(U\), \(D\), and \(V\)
    • We can approximate \(X\) by using the first few columns of \(U\) and \(V\) and singular values of \(D\).
  • But what are these matrices actually doing?
  • Roles:
    • \(U\): project data to \(p\)-dimensional uncorrelated “soccer” ball (sorry rest of world)
    • \(D\): stretch to \(p\)-dimensional ellipse (“American” football), major/minor axes parallel to 0
    • \(V\): rotate the football

Simulate 2D data

  • To investigate, let’s simulate some 2-dimensional data, and mean-center it:
library(MASS)

# parameters
mu <- c(4, 3)
sig1 <- sqrt(2)
sig2 <- sqrt(2)
rho <- 0.8
n <- 40

# Covariance matrix
Sigma <- matrix(c(sig1^2, rho*sig1*sig2,
                  rho*sig1*sig2, sig2^2), 2, 2)

# Simulate n observations
set.seed(322)  
(df <- mvrnorm(n=n, mu=mu, Sigma=Sigma)
  %>% data.frame
  %>% scale(center=TRUE, scale = FALSE)
) %>% head
             X1         X2
[1,] -1.3481310  1.1575286
[2,] -1.1442426 -1.5656080
[3,] -3.3689315 -2.2161506
[4,]  1.7510125  1.7360408
[5,] -2.0321339 -0.8724724
[6,] -0.1900881 -0.3201564

Plot of simulated data

lim <- c(-5,5)

base_canvas <- ggplot() +
  scale_x_continuous(breaks=seq(min(lim), max(lim), by=1),limits=lim) +
    scale_y_continuous(breaks=seq(min(lim), max(lim), by=1),limits=lim) + 
  geom_vline(aes(xintercept = 0)) +  geom_hline(aes(yintercept = 0)) +  
  theme_minimal(base_size = 18) + 
  theme(panel.grid.minor = element_blank()) + 
  labs(x='', y='')


base_canvas + 
  geom_point(aes(x = X1, y = X2), data = df, size = sz) 

Decomposition

decompose_df <- svd(df)

(U <- decompose_df$u) %>% head
            [,1]        [,2]
[1,] -0.02896418  0.45460900
[2,] -0.14872553 -0.14191092
[3,] -0.32127917  0.07790525
[4,]  0.19544748  0.08050787
[5,] -0.17117414  0.14316095
[6,] -0.02763367 -0.03601667
(D <- decompose_df$d)
[1] 12.517035  3.826384
(V <- decompose_df$v)
          [,1]       [,2]
[1,] 0.7925337 -0.6098282
[2,] 0.6098282  0.7925337

Plot of \(U\)

U_df <- data.frame(U)
base_canvas + 
  geom_point(aes(x = X1, y = X2), data = U_df, size = sz) 

Plot of \(UD\)

UD_df <- data.frame(U%*% diag(D))
base_canvas + 
  geom_point(aes(x = X1, y = X2), data = UD_df, size = sz) 

Plot of \(UDV^T\)

UDVt_df <- data.frame(U%*% diag(D)%*%t(V))
base_canvas + 
  geom_point(aes(x = X1, y = X2), data = UDVt_df, size = sz) 

Following the story for a few points

  • \(U\): Uncorrelated ball
  • Also shown, vectors \(\color{purple}V\):
V
          [,1]       [,2]
[1,] 0.7925337 -0.6098282
[2,] 0.6098282  0.7925337

  • \(D\) = 12.5, 3.8, multiply:
    • horizontal coordinates by 12.5
    • vertical coordinates by 3.8
  • Result: uncorrelated ellipse

  • \(\color{purple}{V}\): direction in which to rotate the data
    • Major axis points in \(\color{purple}{V[,1]}\) direction
    • Minor axis points in \(\color{purple}{V[,2]}\) direction

An alternative perspective

We could also consider \(DV^T\) as simultaneous instructions for both rotation and stretching:

“Stretch a lot along the purple axis, and a little along the green axis.”

3D example

  • We can still visualize this process for 3D data
  • First step: simulate some!
# parameters
mu <- c(4, 3, 3)
sig1 <- sqrt(2) 
sig2 <- sqrt(4)
sig3 <- sqrt(3)
rho <- 0.8
n <- 40
rho13 <- .6
# Covariance matrix
Sigma <- matrix(c(sig1^2, rho*sig1*sig2,rho13*sig1*sig3,
                  rho*sig2*sig1, sig2^2, rho*sig2*sig3,
                  rho13*sig3*sig1, rho*sig3*sig2, sig3^2),
                  3, 3)
# Simulate n observations
set.seed(322)  
(df3 <- mvrnorm(n=n, mu=mu, Sigma=Sigma)
  %>% data.frame
  %>% scale(center=TRUE, scale = FALSE)
) %>% head(4)
             X1         X2        X3
[1,] -2.2216923 -0.3097486  1.747199
[2,] -0.8786427 -1.8954720 -1.899526
[3,] -2.6118794 -5.0582880 -1.856903
[4,]  1.9405064  2.1099303  2.234284

3D decomposition

decompose_3d  <- svd(df3)
U <- decompose_3d$u
D <- decompose_3d$d
V <- decompose_3d$v

Plot of \(U\)

Plot of \(UD\)

In which direction (\(UD[,1]\), \(UD[,2]\), or \(UD[,3]\)) is the variability greatest? Least?

Plot of \(UDV^T\)

Relationship between SVD and eigendecomposition

  • Hopefully, this idea of rotating/stretching rings some bells from our discussion of eigendecomposition!
  • Turns out there’s a direct relationship between SVD of data matrix \(X\) and eigendecomposition of \(\Sigma\), its covariance.
  • In general:
    • \(V\) contains eigenvectors of \(X^TX\)
    • \(D\) are the eigenvalues of \(X^T X\)
  • Assuming \(X\) is mean-centered, \(\Sigma = \frac{1}{n-1}X^TX\)
    • Eigenvectors of \(\Sigma\) will point in the same direction as \(V\)
    • Eigenvalues of \(\Sigma\) are squared multiples of \(D\)

Relationships

Check \(\Sigma = \frac{1}{n-1}X^T X\):

cov(df3)
         X1       X2       X3
X1 2.902583 3.120883 1.809516
X2 3.120883 4.420379 2.561156
X3 1.809516 2.561156 2.537923
t(df3)%*%df3/(n-1)
         X1       X2       X3
X1 2.902583 3.120883 1.809516
X2 3.120883 4.420379 2.561156
X3 1.809516 2.561156 2.537923

SVD \(X\) and eigendecompose \(\Sigma\):

svdX <- svd(df3)
eigenSigma <- eigen(cov(df3))

Verify eigenvectors of \(\Sigma\) are the same axes as \(V\):

eigenSigma$vectors
           [,1]       [,2]       [,3]
[1,] -0.5400369 -0.4958369  0.6800779
[2,] -0.7019966 -0.1803834 -0.6889576
[3,] -0.4642853  0.8494749  0.2506621
svdX$v
          [,1]       [,2]       [,3]
[1,] 0.5400369  0.4958369 -0.6800779
[2,] 0.7019966  0.1803834  0.6889576
[3,] 0.4642853 -0.8494749 -0.2506621

Verify eigenvalues of \(\Sigma\) are squared multiples of \(D\):

eigenSigma$values
[1] 8.5151266 0.9378591 0.4079004
svdX$d^2/(n-1)
[1] 8.5151266 0.9378591 0.4079004