Linear Algebra for Data Science in R

Motivations

Creating Vectors in R

Vectors can be analyzed in a few different ways in R.

One way is to create the vector yourself, and there are a few different means by which you can do this.

The rep() command in R creates a vector that repeats the same element a prescribed number of times, while the seq() command creates a vector that follows a prescribed pattern.

A vector can be manually created by using the c() command, which stands for “concatenate”.

# Creating three 3's and four 4's, respectively
rep(3, 3)

## [1] 3 3 3

rep(4, 4)

## [1] 4 4 4 4

# Creating a vector with the first three even numbers and the first three odd numbers
seq(2, 6, by = 2)

## [1] 2 4 6

seq(1, 5, by = 2)

## [1] 1 3 5

# Re-creating the previous four vectors using the 'c' command
c(3, 3, 3)

## [1] 3 3 3

c(4, 4, 4, 4)

## [1] 4 4 4 4

c(2, 4, 6)

## [1] 2 4 6

c(1, 3, 5)

## [1] 1 3 5

Awesome! You’ve created vectors using some commands in R.

The Algebra of Vectors

Vectors can be made to make new vectors much in the same way that numbers can.

To add two vectors together in R, simply use the + command.

Vectors can also be scaled via multiplication by a real number (or scalar), using the * command.

Multiplication of two vectors is a trickier topic theoretically, but in R, component-wise multiplication of two vectors is legal and accomplished using the * command.

The vectors x, y, and z have been loaded for you.

x <- c(1, 2, 3, 4, 5, 6, 7)
y <- c(2, 4, 6, 8, 10, 12, 14)
z <- c(1, 1, 2)

# Add x to y and print
print(x + y)

## [1]  3  6  9 12 15 18 21

# Multiply z by 2 and print
print(2*z)

## [1] 2 2 4

# Multiply x and y by each other and print
print(x*y)

## [1]  2  8 18 32 50 72 98

# Add x to z, if possible, and print
print(x + z)

## Warning in x + z: longer object length is not a multiple of shorter object
## length

## [1] 2 3 5 5 6 8 8

Great! Vector operations are a useful way to create other vectors. In the case where they are not the same size, R recycles elements of the shorter vector to obtain the given result (with a warning message).

Creating Matrices in R

Matrices can be created and analyzed in a few different ways in R.

One way is to create the matrix yourself. There are a few different ways you can do this.

The matrix(a, nrow = b, ncol = c) command in R creates a matrix that repeats the element a in a matrix with b rows and c columns.

A matrix can be manually created by using the c() command as well.

A <- matrix(1, nrow = 2, ncol = 2)

# Create a matrix of all 1's and all 2's that are 2 by 3 and 3 by 2, respectively
matrix(1, nrow = 2, ncol = 3)

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1

print(matrix(2, nrow = 3, ncol = 2))

##      [,1] [,2]
## [1,]    2    2
## [2,]    2    2
## [3,]    2    2

# Create a matrix and changing the byrow designation.
B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = FALSE)
B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = TRUE)

# Add A to the previously-created matrix
A + B

##      [,1] [,2]
## [1,]    2    3
## [2,]    4    3

Good work! You created matrices and added them to existing ones!

Matrix-Vector Operations

A vector needs to have the same amount of elements as the number of columns in a matrix for multiplication

Matrix Multiplication as a Transformation

Matrices can be viewed as a way to transform collections of vectors into other vectors.

These transformations can take many forms, but the simplest ones in two dimensions are stretches or shrinkages (in either coordinate), reflections (e.g. about the x-axis, y-axis, origin, the line y = x), and rotations (clockwise, counter-clockwise).

Multiplication of a vector by a matrix is accomplished using the %*% command.

A <- matrix(c(4, 0, 0, 1), 2, 2)
B <- matrix(c(1, 0, 0.0, 2/3), 2, 2)
b <- c(1,1)
A

##      [,1] [,2]
## [1,]    4    0
## [2,]    0    1

##      [,1]      [,2]
## [1,]    1 0.0000000
## [2,]    0 0.6666667

## [1] 1 1

# Multiply A by b
A%*%b

##      [,1]
## [1,]    4
## [2,]    1

# Multiply B by b
B%*%b

##           [,1]
## [1,] 1.0000000
## [2,] 0.6666667

You’ve seen how multiplication by a matrix can alter a vector.

Reflections

In the last exercise, you looked at stretching or shrinking components of a vector.

In this one, you’ll apply a reflection matrix to the vector b <- c(2, 1).

A <- matrix(c(-1, 0, 0, 1), 2, 2)
B <- matrix(c(1, 0, 0, -1), 2, 2)
C <- matrix(c(-4, 0, 0, -2), 2, 2)
b <- c(2, 1)
A

##      [,1] [,2]
## [1,]   -1    0
## [2,]    0    1

##      [,1] [,2]
## [1,]    1    0
## [2,]    0   -1

##      [,1] [,2]
## [1,]   -4    0
## [2,]    0   -2

## [1] 2 1

# Multiply A by b 
A%*%b

##      [,1]
## [1,]   -2
## [2,]    1

# Multiply B by b 
B%*%b

##      [,1]
## [1,]    2
## [2,]   -1

# Multiply C by b
C%*%b

##      [,1]
## [1,]   -8
## [2,]   -2

Awesome! You’ve seen how multiplication by a matrix can alter a vector.

Matrix-Matrix Calculations

order matters
- matrix multiplication is not commutative

component-wise multiplication (*) is different from matrix multiplication (%*%)

The identity matrix is commutative, however

Matrix Multiplication - Order Matters

In the last lesson, we studied how matrices act on vectors (stretches, shrinkages, reflections, rotations, etc.) and transform vectors into new vectors.

The successive application of these matrices can act as complex transformations, but because matrix multiplication is not commutative, the order of these transformations matter.

The matrix with R output

> A
          [,1]       [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068  0.7071068

represents rotation of a 2-dimensional vector by 45 degrees counterclockwise.

The matrix

> B
     [,1] [,2]
[1,]    1    0
[2,]    0   -1

represents a reflection about the x (first) axis.

A <- matrix(c(0.7071068, 0.7071068, -0.7071068, 0.7071068), nrow = 2, ncol = 2)
B <- matrix(c(1, 0, 0, -1), nrow = 2, ncol = 2)
b <- c(1, 1)

# Multiply A by B
A%*%B

##           [,1]       [,2]
## [1,] 0.7071068  0.7071068
## [2,] 0.7071068 -0.7071068

# Multiply A on the right of B
B%*%A

##            [,1]       [,2]
## [1,]  0.7071068 -0.7071068
## [2,] -0.7071068 -0.7071068

# Multiply the product of A and B by the vector b
A%*%B%*%b

##          [,1]
## [1,] 1.414214
## [2,] 0.000000

# Multiply A on the right of B, and then by the vector b
B%*%A%*%b

##           [,1]
## [1,]  0.000000
## [2,] -1.414214

Great Work! You’ve demonstrated that matrix multiplication is not commutative.

Intro to The Matrix Inverse

We talked briefly about the identity matrix in the video. Another important concept to understand in matrix multiplication is that of the matrix inverse.

For any number \(a\) (aside from \(0\)), there’s always a number \(\frac{1}{a}\) that can be used to “undo” multiplication by \(a\).

For matrices, this is not always true. However, when it is, we call the matrix that, when applied to \(A\), yields the identity matrix \(I\), that matrix’s inverse.

The solve function in R will find the inverse of a matrix if it exists and provide an error if it does not.

A <- matrix(c(1, -1, 2, 2), nrow = 2, ncol = 2)
A

##      [,1] [,2]
## [1,]    1    2
## [2,]   -1    2

# Take the inverse of the 2 by 2 identity matrix
solve(diag(2))

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

# Take the inverse of the matrix A
Ainv <- solve(A)

# Multiply A inverse by A
Ainv%*%A

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

# Multiply A by its inverse
A%*%Ainv

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Awesome. You’ve discovered how to compute the inverse of a matrix, if it exists.

Motivation for Solving Matrix-Vector Equations

A matrix multiplied by a vector is a linear combination of the columns of A times the elements of X.

The elements of \(\vec{x}\) are the coefficients when one creates \(\vec{b}\) from the columns of \(A\)!

Exploring WNBA Data

In this chapter, you will work with a matrix-vector model for team strength in the Women’s National Basketball Association (WNBA) at the conclusion of the 2017 season. These team strengths can be used to predict who will win a match between any two teams, for example, using machine learning.

The WNBA has 12 teams, so your Massey Matrix \(M\) will be 12x12, and will initially be loaded as a data frame. The vector with the point differentials for each team will be a vector with 12 elements.

library(readr)
M <- read_csv("_data/WNBA_Data_2017_M.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Atlanta = col_double(),
##   Chicago = col_double(),
##   Connecticut = col_double(),
##   Dallas = col_double(),
##   Indiana = col_double(),
##   `Los Angeles` = col_double(),
##   Minnesota = col_double(),
##   `New York` = col_double(),
##   Phoenix = col_double(),
##   `San Antonio` = col_double(),
##   Seattle = col_double(),
##   Washington = col_double(),
##   WNBA = col_double()
## )

f <- read_csv("_data/WNBA_Data_2017_f.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Differential = col_double()
## )

# Print the Massey Matrix M
print(M)

## # A tibble: 13 x 13
##    Atlanta Chicago Connecticut Dallas Indiana `Los Angeles` Minnesota `New York`
##      <dbl>   <dbl>       <dbl>  <dbl>   <dbl>         <dbl>     <dbl>      <dbl>
##  1      33      -4          -2     -3      -3            -3        -3         -3
##  2      -4      33          -3     -3      -3            -3        -2         -3
##  3      -2      -3          34     -3      -3            -3        -3         -4
##  4      -3      -3          -3     34      -3            -4        -3         -3
##  5      -3      -3          -3     -3      33            -3        -3         -3
##  6      -3      -3          -3     -4      -3            41        -8         -3
##  7      -3      -2          -3     -3      -3            -8        41         -3
##  8      -3      -3          -4     -3      -3            -3        -3         34
##  9      -3      -3          -4     -2      -3            -6        -4         -3
## 10      -3      -3          -3     -3      -3            -3        -3         -2
## 11      -3      -3          -3     -3      -2            -2        -3         -3
## 12      -3      -3          -3     -4      -4            -3        -6         -4
## 13       1       1           1      1       1             1         1          1
## # ... with 5 more variables: Phoenix <dbl>, `San Antonio` <dbl>, Seattle <dbl>,
## #   Washington <dbl>, WNBA <dbl>

# Print the vector of point differentials f
print(f)

## # A tibble: 13 x 1
##    Differential
##           <dbl>
##  1         -135
##  2         -171
##  3          152
##  4         -104
##  5         -308
##  6          292
##  7          420
##  8           83
##  9           -4
## 10         -213
## 11           -5
## 12           -7
## 13            0

# Find the sum of the first column of M
sum(M[, 1])

## [1] 1

# Find the sum of the vector f
sum(f)

## [1] 0

Great! Notice that, since every point scored by one team is allowed by another, the vector f of point differentials needs to sum to zero!

Matrix-Vector Equations - Some Theory

The WNBA Massey matrix is not (computationally) invertible, so an adjustment needs to be made to our model.

One way we can change this is to add a row of 1’s on the bottom of the matrix \(M\), a column of -1’s to the far right of \(M\), and a 0 to the bottom of the vector of point differentials \(\vec{f}\).

In the setting of rating teams, that row of 1’s represents (final equation stipulates) that The ratings for the entire league add to zero. This condition is good not only mathematically, but also for consistency in the rating system!

Adjusting the Massey Matrix

For our WNBA Massey Matrix model, some adjustments need to be made for a solution to our rating problem to exist and be unique. This is because the matrix we currently have is not (computationally) invertible.

In this exercise, you will actually add all of these to the matrix \(M\).

M <- read_csv("_data/WNBA_Data_2017_M.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Atlanta = col_double(),
##   Chicago = col_double(),
##   Connecticut = col_double(),
##   Dallas = col_double(),
##   Indiana = col_double(),
##   `Los Angeles` = col_double(),
##   Minnesota = col_double(),
##   `New York` = col_double(),
##   Phoenix = col_double(),
##   `San Antonio` = col_double(),
##   Seattle = col_double(),
##   Washington = col_double(),
##   WNBA = col_double()
## )

M <- as.matrix(M)

# Add a row of 1's
M_2 <- rbind(M, rep(1, 12))

## Warning in rbind(M, rep(1, 12)): number of columns of result is not a multiple
## of vector length (arg 2)

# Add a column of -1's 
M_3 <- cbind(M_2, rep(-1, 13))

## Warning in cbind(M_2, rep(-1, 13)): number of rows of result is not a multiple
## of vector length (arg 2)

# Change the element in the lower-right corner of the matrix
M_3[13, 13] <- 1

# Print M_3
print(M_3)

##       Atlanta Chicago Connecticut Dallas Indiana Los Angeles Minnesota New York
##  [1,]      33      -4          -2     -3      -3          -3        -3       -3
##  [2,]      -4      33          -3     -3      -3          -3        -2       -3
##  [3,]      -2      -3          34     -3      -3          -3        -3       -4
##  [4,]      -3      -3          -3     34      -3          -4        -3       -3
##  [5,]      -3      -3          -3     -3      33          -3        -3       -3
##  [6,]      -3      -3          -3     -4      -3          41        -8       -3
##  [7,]      -3      -2          -3     -3      -3          -8        41       -3
##  [8,]      -3      -3          -4     -3      -3          -3        -3       34
##  [9,]      -3      -3          -4     -2      -3          -6        -4       -3
## [10,]      -3      -3          -3     -3      -3          -3        -3       -2
## [11,]      -3      -3          -3     -3      -2          -2        -3       -3
## [12,]      -3      -3          -3     -4      -4          -3        -6       -4
## [13,]       1       1           1      1       1           1         1        1
## [14,]       1       1           1      1       1           1         1        1
##       Phoenix San Antonio Seattle Washington WNBA   
##  [1,]      -3          -3      -3         -3   -1 -1
##  [2,]      -3          -3      -3         -3   -1 -1
##  [3,]      -4          -3      -3         -3   -1 -1
##  [4,]      -2          -3      -3         -4   -1 -1
##  [5,]      -3          -3      -2         -4   -1 -1
##  [6,]      -6          -3      -2         -3   -1 -1
##  [7,]      -4          -3      -3         -6   -1 -1
##  [8,]      -3          -2      -3         -4   -1 -1
##  [9,]      38          -3      -4         -3   -1 -1
## [10,]      -3          32      -4         -2   -1 -1
## [11,]      -4          -4      33         -3   -1 -1
## [12,]      -3          -2      -3         38   -1 -1
## [13,]       1           1       1          1    1 -1
## [14,]       1           1       1          1    1 -1

Awesome! You’re learning how to manipulate matrices in R!

Inverting the Massey Matrix

Now that you’ve updated the Massey Matrix \(M\), you’re almost ready to find out how the teams stacked up at the end of the 2017 season.

First, we need to make sure that the updated \(M\) has an inverse.

# Find the inverse of M
solve(M)

##                     [,1]         [,2]         [,3]         [,4]         [,5]
## Atlanta      0.032449804  0.005402927  0.003876665  0.004630004  0.004629590
## Chicago      0.005402927  0.032446789  0.004608094  0.004626913  0.004628272
## Connecticut  0.003876665  0.004608094  0.031714805  0.004613451  0.004629714
## Dallas       0.004630004  0.004626913  0.004613451  0.031707219  0.004649172
## Indiana      0.004629590  0.004628272  0.004629714  0.004649172  0.032447936
## Los Angeles  0.004626242  0.004554829  0.004676789  0.005214940  0.004652111
## Minnesota    0.004611109  0.003985203  0.004651940  0.004727810  0.004678479
## New York     0.004609212  0.004627729  0.005362761  0.004647832  0.004649262
## Phoenix      0.004610546  0.004608018  0.005295038  0.004013187  0.004613089
## San Antonio  0.004630254  0.004631081  0.004608596  0.004609009  0.004587382
## Seattle      0.004629212  0.004631185  0.004646217  0.004595132  0.003854641
## Washington   0.004627769  0.004582295  0.004649264  0.005298666  0.005313685
## WNBA        -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                     [,6]         [,7]         [,8]         [,9]        [,10]
## Atlanta      0.004626242  0.004611109  0.004609212  0.004610546  0.004630254
## Chicago      0.004554829  0.003985203  0.004627729  0.004608018  0.004631081
## Connecticut  0.004676789  0.004651940  0.005362761  0.005295038  0.004608596
## Dallas       0.005214940  0.004727810  0.004647832  0.004013187  0.004609009
## Indiana      0.004652111  0.004678479  0.004649262  0.004613089  0.004587382
## Los Angeles  0.027807608  0.007319076  0.004637275  0.006363490  0.004606288
## Minnesota    0.007319076  0.027810474  0.004677632  0.005388578  0.004578013
## New York     0.004637275  0.004677632  0.031716432  0.004648253  0.003835528
## Phoenix      0.006363490  0.005388578  0.004648253  0.029212019  0.004646110
## San Antonio  0.004606288  0.004578013  0.003835528  0.004646110  0.033267202
## Seattle      0.004032687  0.004573214  0.004607331  0.005265228  0.005427397
## Washington   0.004841998  0.006331805  0.005314087  0.004669776  0.003906474
## WNBA        -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                    [,11]        [,12]        [,13]
## Atlanta      0.004629212  0.004627769 8.333333e-02
## Chicago      0.004631185  0.004582295 8.333333e-02
## Connecticut  0.004646217  0.004649264 8.333333e-02
## Dallas       0.004595132  0.005298666 8.333333e-02
## Indiana      0.003854641  0.005313685 8.333333e-02
## Los Angeles  0.004032687  0.004841998 8.333333e-02
## Minnesota    0.004573214  0.006331805 8.333333e-02
## New York     0.004607331  0.005314087 8.333333e-02
## Phoenix      0.005265228  0.004669776 8.333333e-02
## San Antonio  0.005427397  0.003906474 8.333333e-02
## Seattle      0.032485332  0.004585756 8.333333e-02
## Washington   0.004585756  0.029211757 8.333333e-02
## WNBA        -0.083333333 -0.083333333 2.220446e-16

Awesome! You found the inverse of a real-world matrix!

Solving Matrix-Vector Equations

2017 WNBA Ratings!

Now that we have our Massey matrix in a form for which is invertible, we can now use it to find the ratings for the WNBA teams at the conclusion of 2017!

M and an updated version of f are loaded for you.

f <- as.matrix(f)

# Solve for r and rename column
r <- solve(M)%*%f
colnames(r) <- "Rating"

# Print r
print(r)

##                    Rating
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## Connecticut  4.309525e+00
## Dallas      -2.608129e+00
## Indiana     -8.532958e+00
## Los Angeles  7.850327e+00
## Minnesota    1.061241e+01
## New York     2.541565e+00
## Phoenix      8.979110e-01
## San Antonio -6.181574e+00
## Seattle     -2.666953e-01
## Washington   5.468121e-01
## WNBA         1.043610e-14

Awesome! You solved a matrix-vector equation in a real-world setting!

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

as.data.frame(r) %>% arrange(desc(Rating))

##                    Rating
## Minnesota    1.061241e+01
## Los Angeles  7.850327e+00
## Connecticut  4.309525e+00
## New York     2.541565e+00
## Phoenix      8.979110e-01
## Washington   5.468121e-01
## WNBA         1.043610e-14
## Seattle     -2.666953e-01
## Dallas      -2.608129e+00
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## San Antonio -6.181574e+00
## Indiana     -8.532958e+00

The best team was Minnesota.

Other Considerations for Matrix-Vector Equations

Alternatives to the Regular Matrix Inverse

In the last video, we talked briefly about generalized or pseudo-inverses.

In this exercise, you’ll use the Moore-Penrose generalized inverse in the `MASS’ package and find that it produces the regular inverse if the matrix you’re working with is already invertible!

The command ginv() computes the Moore-Penrose generalized inverse in R.

For more information on the Moore-Penrose inverse, see the following link.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Print M
print(M)

##       Atlanta Chicago Connecticut Dallas Indiana Los Angeles Minnesota New York
##  [1,]      33      -4          -2     -3      -3          -3        -3       -3
##  [2,]      -4      33          -3     -3      -3          -3        -2       -3
##  [3,]      -2      -3          34     -3      -3          -3        -3       -4
##  [4,]      -3      -3          -3     34      -3          -4        -3       -3
##  [5,]      -3      -3          -3     -3      33          -3        -3       -3
##  [6,]      -3      -3          -3     -4      -3          41        -8       -3
##  [7,]      -3      -2          -3     -3      -3          -8        41       -3
##  [8,]      -3      -3          -4     -3      -3          -3        -3       34
##  [9,]      -3      -3          -4     -2      -3          -6        -4       -3
## [10,]      -3      -3          -3     -3      -3          -3        -3       -2
## [11,]      -3      -3          -3     -3      -2          -2        -3       -3
## [12,]      -3      -3          -3     -4      -4          -3        -6       -4
## [13,]       1       1           1      1       1           1         1        1
##       Phoenix San Antonio Seattle Washington WNBA
##  [1,]      -3          -3      -3         -3   -1
##  [2,]      -3          -3      -3         -3   -1
##  [3,]      -4          -3      -3         -3   -1
##  [4,]      -2          -3      -3         -4   -1
##  [5,]      -3          -3      -2         -4   -1
##  [6,]      -6          -3      -2         -3   -1
##  [7,]      -4          -3      -3         -6   -1
##  [8,]      -3          -2      -3         -4   -1
##  [9,]      38          -3      -4         -3   -1
## [10,]      -3          32      -4         -2   -1
## [11,]      -4          -4      33         -3   -1
## [12,]      -3          -2      -3         38   -1
## [13,]       1           1       1          1    1

# Find the rating vector the conventional way
r <- solve(M)%*%f
colnames(r) <- "Rating"
print(r)

##                    Rating
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## Connecticut  4.309525e+00
## Dallas      -2.608129e+00
## Indiana     -8.532958e+00
## Los Angeles  7.850327e+00
## Minnesota    1.061241e+01
## New York     2.541565e+00
## Phoenix      8.979110e-01
## San Antonio -6.181574e+00
## Seattle     -2.666953e-01
## Washington   5.468121e-01
## WNBA         1.043610e-14

# Find the rating vector using ginv
r <- ginv(M)%*%f
colnames(r) <- "Rating"
print(r)

##              Rating
##  [1,] -4.012938e+00
##  [2,] -5.156260e+00
##  [3,]  4.309525e+00
##  [4,] -2.608129e+00
##  [5,] -8.532958e+00
##  [6,]  7.850327e+00
##  [7,]  1.061241e+01
##  [8,]  2.541565e+00
##  [9,]  8.979110e-01
## [10,] -6.181574e+00
## [11,] -2.666953e-01
## [12,]  5.468121e-01
## [13,]  5.773160e-14

Great work! You have found the final 2017 WNBA ratings in two ways!

Intro to Eigenvalues and Eigenvectors

Scaling Different Axes

Suppose that you wanted to transform a two-dimensional vector so that the first element doubled in size, while the second element was cut by a third.

You can do this with matrix multiplication, as the matrix A, which yields the R output:

     [,1]      [,2]
[1,]    2 0.0000000
[2,]    0 0.6666667

achieves the desired result.

A <- matrix(c(2, 0, 0.00, 2/3), nrow = 2, ncol = 2)
# Multiply A by the given vector
print(A%*%c(1, 1))

##           [,1]
## [1,] 2.0000000
## [2,] 0.6666667

Awesome! You’ve multiplied by a matrix that contracts and stretches elements of a vector!

Definition of Eigenvalues and Eigenvectors

Eigenvectors can be scaled (are about direction, not magnitude)

An eigenvector is a scalar multiple of itself (“own”) when multiplied by \(A\)!

Finding Eigenvalues in R

In the next video we will explicitly see how to find the eigenpairs for a matrix \(A\), but right now we can at least show that a pair is an eigenpair for a matrix \(A\). We can do this by showing that the difference between \(A\vec{v}\) and \(\lambda\vec{v}\) results in a vector of zeros.

The matrix A with R output:

     [,1] [,2] [,3]
[1,]   -1    2    4
[2,]    0    7   12
[3,]    0    0   -4

is loaded for you.

A <- matrix(c(-1,0,0,2,7,0,4,12,-4), nrow = 3, ncol = 3)

# Show that 7 is an eigenvalue for A
A%*%c(0.2425356, 0.9701425, 0) - 7*c(0.2425356, 0.9701425, 0)

##       [,1]
## [1,] 2e-07
## [2,] 0e+00
## [3,] 0e+00

# Show that -4 is an eigenvalue for A
A%*%c(-0.3789810, -0.6821657, 0.6253186) - -4*c(-0.3789810, -0.6821657, 0.6253186)

##               [,1]
## [1,] -2.220446e-16
## [2,]  5.000000e-07
## [3,]  0.000000e+00

# Show that -1 is an eigenvalue for A
A%*%c(1, 0, 0) - -1*c(1, 0, 0)

##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0

Awesome! You’ve learned how to determine if a given pair is an eigenpair of a matrix. Note that there are often rounding errors that occur in R and, in this case, what you’re seeing is an approximation of the zero vector.

Scalar Multiplies of Eigenvectors are Eigenvectors

As we said in the videos, an eigenvector of \(A\) associated with a matrix \(A\) can be scaled to meet the needs of the problem at hand. For example, for Markov models, having all of the elements sum to 1 means that the elements are probabilities, and thus have a clear interpretation.

In this exercise, we will be working with the first eigenpair in the previous exercise. For matrix \(A\), this eigenpair has an eigenvalue \(\lambda = 7\) and eigenvector:

          [,1]
[1,] 0.2425356
[2,] 0.9701425
[3,] 0.0000000

# Show that double an eigenvector is still an eigenvector
A%*%((2)*c(0.2425356, 0.9701425, 0)) - 7*(2)*c(0.2425356, 0.9701425, 0)

##       [,1]
## [1,] 4e-07
## [2,] 0e+00
## [3,] 0e+00

# Show half of an eigenvector is still an eigenvector
A%*%((0.5)*c(0.2425356, 0.9701425, 0)) - 7*(0.5)*c(0.2425356, 0.9701425, 0)

##       [,1]
## [1,] 1e-07
## [2,] 0e+00
## [3,] 0e+00

Awesome! You’ve shown that a scalar multiple of an eigenvector is still an eigenvector of a matrix! Note that often times there are rounding errors that occur in R, and in this case what you’re seeing is an approximation of the zero vector.

Computing Eigenvalues and Eigenvectors in R

Verifying the Math on Eigenvalues

In this exercise you’ll find the eigenvalues \(\lambda\) of a matrix \(A\), and show that they satisfy the property that the matrix \(\lambda I - A\) is not invertible, with determinant equal to zero.

A <- matrix(c(1,1,2,1), nrow = 2, ncol = 2)

# Compute the eigenvalues of A and store in Lambda
Lambda <- eigen(A)

# Print eigenvalues
print(Lambda$values[1])

## [1] 2.414214

print(Lambda$values[2])

## [1] -0.4142136

# Verify that these numbers satisfy the conditions of being an eigenvalue
det(Lambda$values[1]*diag(2) - A)

## [1] -3.140185e-16

det(Lambda$values[2]*diag(2) - A)

## [1] -3.140185e-16

Great! You showed that eigenvalues indeed satisfy the conditions they are meant to. Note that often times there are rounding errors that occur in R, and in this case what you’re seeing is an approximation of the zero vector.

Computing Eigenvectors in R

In this exercise you’ll find the eigenvectors of a matrix, and show that they satisfy the properties discussed in the lecture.

# Find the eigenvectors of A and store them in Lambda
Lambda <- eigen(A)

# Print eigenvectors
print(Lambda$vectors[, 1])

## [1] 0.8164966 0.5773503

print(Lambda$vectors[, 2])

## [1] -0.8164966  0.5773503

# Verify that these eigenvectors & their associated eigenvalues satisfy lambda*v - A*v = 0
Lambda$values[1]*Lambda$vectors[, 1] - A%*%Lambda$vectors[, 1]

##      [,1]
## [1,]    0
## [2,]    0

Lambda$values[2]*Lambda$vectors[, 2] - A%*%Lambda$vectors[, 2]

##               [,1]
## [1,] -1.110223e-16
## [2,]  8.326673e-17

Awesome! You showed that eigenvectors indeed satisfy the conditions they are meant to. Note that often times there are rounding errors that occur in R, and in this case what you’re seeing is an approximation of the zero vector.

Some More on Eigenvalues and Eigenvectors

Eigenvalue Ordering

R displays the eigenvalues in descending order of size because the largest eigenvalue(s) characterize the application of the matrix the more times it’s applied.

Markov Models for Allele Frequencies

In the lecture, you saw that the leading eigenvalue of the Markov matrix \(M\), whose R output is:

      [,1]  [,2]  [,3]  [,4]
[1,] 0.980 0.005 0.005 0.010
[2,] 0.005 0.980 0.010 0.005
[3,] 0.005 0.010 0.980 0.005
[4,] 0.010 0.005 0.005 0.980

produced an eigenvector modeling a situation where the alleles are represented equally (each with probability 0.25).

In this exercise, we use a for-loop to iterate the process of mutation from an initial allele distribution of:

[1] 1 0 0 0

and show that this is indeed what happens - that the eigenvector yields the right information in lieu of the for-loop.

For more on Markov Processes, see this link.

# This code iterates mutation 1000 times
M <- matrix(c(.98,.005,.005,.01,.005,.98,.01,.005,.005,.01,.98,.005,.01,.005,.005,.98), nrow = 4, ncol = 4)
x <- c(1, 0, 0, 0)
for (j in 1:1000) {x <- M%*%x}

# Print x
print(x)

##      [,1]
## [1,] 0.25
## [2,] 0.25
## [3,] 0.25
## [4,] 0.25

# Print and scale first eigenvector of M
Lambda <- eigen(M)
v1 <- Lambda$vectors[, 1]/sum(Lambda$vectors[, 1])

# Print v1
print(v1)

## [1] 0.25 0.25 0.25 0.25

Great! You have seen how eigen analyses can help us understand genomics!

Intro to the Idea of PCA

Finding Redundancies

One of the important things that principal component analysis can do is shrink redundancy in your dataset. In its simplest manifestation, redundancy occurs when two variables are correlated.

The Pearson correlation coefficient is a number between -1 and 1. Coefficients near zero indicate two variables are linearly independent, while coefficients near -1 or 1 indicate that two variables are linearly related.

The dataset combine has been loaded for you.

combine <- read.csv("_data/DataCampCombine.csv", stringsAsFactors = TRUE)

# Print the first 6 observations of the dataset
head(combine, n = 6)

##              player position       school year height weight forty vertical
## 1   Jaire Alexander       CB   Louisville 2018     71    192  4.38     35.0
## 2       Brian Allen        C Michigan St. 2018     73    298  5.34     26.5
## 3      Mark Andrews       TE     Oklahoma 2018     77    256  4.67     31.0
## 4         Troy Apke        S     Penn St. 2018     74    198  4.34     41.0
## 5 Dorance Armstrong     EDGE       Kansas 2018     76    257  4.87     30.0
## 6         Ade Aruna       DE       Tulane 2018     78    262  4.60     38.5
##   bench broad_jump three_cone shuttle
## 1    14        127       6.71    3.98
## 2    27         99       7.81    4.71
## 3    17        113       7.34    4.38
## 4    16        131       6.56    4.03
## 5    20        118       7.12    4.23
## 6    18        128       7.53    4.48
##                                       drafted
## 1  Green Bay Packers / 1st / 18th pick / 2018
## 2  Los Angeles Rams / 4th / 111th pick / 2018
## 3   Baltimore Ravens / 3rd / 86th pick / 2018
## 4                                            
## 5                                            
## 6 Minnesota Vikings / 6th / 218th pick / 2018

# Find the correlation between variables forty and three_cone
cor(combine$forty, combine$three_cone)

## [1] 0.8315171

# Find the correlation between variables vertical and broad_jump
cor(combine$vertical, combine$broad_jump)

## [1] 0.8163375

There are at least two redundancies in this dataset, and PCA can help us properly handle these.

The Linear Algebra Behind PCA

First subtract mean from each column

In this example, the eigenvalue of the first column contains all of the information of the matrix. As we can see, the first column is simple 0.5 of the second column, they are perfectly correlated.

Standardizing Your Data

In the lecture, we saw that you can learn a lot about a dataset by creating the matrix \(A^TA\) from it. In this exercise, you’ll do that with athletic data for players entering the National Football League college draft. The dataset combine is loaded for you.

# Extract columns 5-12 of combine
A <- combine[, 5:12]

# Make A into a matrix
A <- as.matrix(A)

# Subtract the mean of each column
A <- apply(A, 2, function(x) x - mean(x))

Terrific! You’ve successfully pre-processed the combine dataset.

Variance/Covariance Calculations

In the last exercise you pre-processed the combine data into a matrix A. We now use A to create the variance-covariance matrix of this dataset.

# Create matrix B from equation in instructions
B <- t(A)%*%A/(nrow(A) - 1)

# Compare 1st element of the 1st column of B to the variance of the first column of A
B[1,1]

## [1] 7.159794

var(A[, 1])

## [1] 7.159794

# Compare 1st element of 2nd column of B to the 1st element of the 2nd row of B to the covariance between the first two columns of A
B[1, 2]

## [1] 90.78808

B[2, 1]

## [1] 90.78808

cov(A[, 1], A[, 2])

## [1] 90.78808

Great stuff! You’re understanding the structure of the variance-covariance matrix of a dataset.

Eigenanalyses of Combine Data

Now that you have looked at and understand the contents of B, you can study its eigen data to see if there is a chance to reduce the dimension of your data down from eight columns to something smaller. B is loaded for you.

# Find eigenvalues of B
V <- eigen(B)

# Print eigenvalues
V$values

## [1] 2.187628e+03 4.403246e+01 2.219205e+01 5.267129e+00 2.699702e+00
## [6] 6.317016e-02 1.480866e-02 1.307283e-02

Great work. You’ve extracted a great deal of context about this dataset!

A great deal of the variability in the athletic profile of future NFL players can be attributed to one linear combination of the data!

Performing PCA in R

Scaling Data Before PCA

When dealing with data that has features with different scales, it’s often important to scale the data first. This is because data that has larger values may sway the data even with relatively little variability.

The combine data frame is loaded for you.

# Scale columns 5-12 of combine
B <- scale(combine[, 5:12])

# Print the first 6 rows of the data
head(B)

##           height      weight      forty   vertical      bench  broad_jump
## [1,] -1.11844839 -1.30960025 -1.3435337  0.5624657 -1.1089286  1.45502476
## [2,] -0.37100257  1.00066356  1.6449741 -1.4281627  0.9238361 -1.49512459
## [3,]  1.12388907  0.08527601 -0.4407553 -0.3743006 -0.6398290 -0.02004991
## [4,]  0.00272034 -1.17883060 -1.4680548  1.9676151 -0.7961955  1.87647467
## [5,]  0.75016616  0.10707096  0.1818505 -0.6084922 -0.1707295  0.50676247
## [6,]  1.49761199  0.21604566 -0.6586673  1.3821362 -0.4834625  1.56038724
##       three_cone    shuttle
## [1,] -1.38083506 -1.5879750
## [2,]  1.16888714  1.1170258
## [3,]  0.07946038 -0.1057828
## [4,] -1.72852445 -1.4027010
## [5,] -0.43048406 -0.6616049
## [6,]  0.51986694  0.2647653

# Summarize the principal component analysis
summary(prcomp(B))

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3679 0.9228 0.78904 0.61348 0.46811 0.37178 0.34834
## Proportion of Variance 0.7009 0.1064 0.07782 0.04704 0.02739 0.01728 0.01517
## Cumulative Proportion  0.7009 0.8073 0.88514 0.93218 0.95957 0.97685 0.99202
##                            PC8
## Standard deviation     0.25266
## Proportion of Variance 0.00798
## Cumulative Proportion  1.00000

Awesome! You can perform PCA in R.

Summarizing PCA in R

As we saw in the video, there was a categorical variable (position) in our data that seemed to identify itself with clusters in the first two principal components. Even when scaling the data, these two PCs still explain a great deal of variation in the data. What if we looked at only one position at a time?

# Subset combine only to "WR"
combine_WR <- subset(combine, position == "WR")

# Scale columns 5-12 of combine_WR
B <- scale(combine_WR[, 5:12])

# Print the first 6 rows of the data
head(B)

##        height      weight       forty   vertical      bench  broad_jump
## 7   1.4022982  0.88324903  1.20674474 -0.3430843 -0.3223377  0.07414249
## 17  0.5575402 -0.09700717 -0.80129388 -0.4969965 -0.7938424 -0.95388361
## 18  0.9799192  1.58343202  0.88968601  1.0421255  0.8564239  1.61618163
## 25  0.9799192  1.16332222  1.41811723 -1.5743819 -0.7938424 -1.29655897
## 29 -1.1319757 -1.56739147 -0.80129388 -0.1891721 -0.0865854 -1.29655897
## 46  0.1351613  0.11304773  0.04419607  0.2725645 -1.0295947  0.24548017
##      three_cone     shuttle
## 7   0.712845019  0.02833449
## 17 -1.098542478  0.84141123
## 18 -1.853287268 -1.46230619
## 25 -1.148858797  0.50262926
## 29  0.008416548 -0.64922946
## 46  0.109049187  0.84141123

# Summarize the principal component analysis
summary(prcomp(B))

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5425 1.4255 1.0509 0.9603 0.77542 0.63867 0.59792
## Proportion of Variance 0.2974 0.2540 0.1380 0.1153 0.07516 0.05099 0.04469
## Cumulative Proportion  0.2974 0.5514 0.6894 0.8047 0.87987 0.93085 0.97554
##                            PC8
## Standard deviation     0.44235
## Proportion of Variance 0.02446
## Cumulative Proportion  1.00000

Once a major variable is removed, there’s a lot more structure to the data’s principal components!

Great work! You’ve used PCA to wrangle data, and then applied PCA again!

##Wrap-up