Linear Algebra for Data Science (DataCamp)

Ch. 1 - Introduction to Linear Algebra

Motivations

[Video]

Creating Vectors in R

# Creating three 3's and four 4's, respectively
rep(3, 3)

## [1] 3 3 3

rep(4, 4)

## [1] 4 4 4 4

# Creating a vector with the first three even numbers and the first three odd numbers
seq(2, 6, by = 2)

## [1] 2 4 6

seq(1, 5, by = 2)

## [1] 1 3 5

# Re-creating the previous four vectors using the 'c' command
c(3, 3, 3)

## [1] 3 3 3

c(4, 4, 4, 4)

## [1] 4 4 4 4

c(2, 4, 6)

## [1] 2 4 6

c(1, 3, 5)

## [1] 1 3 5

The Algebra of Vectors

# Add x to y and print
print(x + y)

## [1]  3  6  9 12 15 18 21

# Multiply z by 2 and print
print(2*z)

## [1] 2 2 4

# Multiply x and y by each other and print
print(x*y)

## [1]  2  8 18 32 50 72 98

# Add x to z, if possible, and print
print(x + z)

## Warning in x + z: longer object length is not a multiple of shorter object
## length

## [1] 2 3 5 5 6 8 8

Creating Matrices in R

# Create a matrix of all 1's and all 2's that are 2 by 3 and 3 by 2, respectively
matrix(1, nrow = 2, ncol = 3)

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1

print(matrix(2, nrow = 3, ncol = 2))

##      [,1] [,2]
## [1,]    2    2
## [2,]    2    2
## [3,]    2    2

# Create a matrix and changing the byrow designation.
B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = FALSE)
B <- matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = TRUE)

# Add A to the previously-created matrix
A + B

##      [,1] [,2]
## [1,]    2    3
## [2,]    4    3

Matrix-Vector Operations

[Video]

Matrix-Vector Compatibility

Consider the matrix A created by the R code:

A = matrix(c(1, 2, 3, -1, 0, 3), nrow = 2, ncol = 3, byrow = TRUE)

Which of the following vectors b can be multiplied by A to create Ab?

[*] b = c(1, 1, -1)
b = c(-2, 2)
b = c(2, -1, 3, 4, 7)
b = c(-1, 2, 1, 3)

Matrix Multiplication as a Transformation

# Multiply A by b
A%*%b

##      [,1]
## [1,]    4
## [2,]    1

# Multiply B by b
B%*%b

##          [,1]
## [1,] 0.000000
## [2,] 1.666667

Reflections

# Multiply A by b 
A%*%b

##      [,1]
## [1,]   -2
## [2,]    1

# Multiply B by b 
B%*%b

##      [,1]
## [1,]    2
## [2,]   -1

# Multiply C by b
C%*%b

##      [,1]
## [1,]   -8
## [2,]   -2

Matrix-Matrix Calculations

[Video]

Matrix Multiplication Compatibility

The two matrices generated by the R code below are (small) examples of what are used in neural network models to weigh datasets for prediction:

A = matrix(c(1, 3, 2, -1, 0, 1), nrow = 2, ncol = 3)

B = matrix(c(-1, 1, 2, -3), nrow = 2, ncol = 2)

Often times these collections of weights are applied iteratively using successive applications of matrix multiplication.

Are A and B compatible in any way in terms of matrix multiplication? Use A%*%B and B%*%A in the console to check. What are the dimensions of the resulting matrix?

No, these matrices are not compatible.
[*] Yes, the multiplication BA results in a 2 by 3 matrix.
Yes, the multiplication AB results in a 2 by 3 matrix.
Yes, the multiplication BA results in a 3 by 2 matrix.

Matrix Multiplication - Order Matters

# Multiply A by B
A%*%B

##           [,1]       [,2]
## [1,] 0.7071068  0.7071068
## [2,] 0.7071068 -0.7071068

# Multiply A on the right of B
B%*%A

##            [,1]       [,2]
## [1,]  0.7071068 -0.7071068
## [2,] -0.7071068 -0.7071068

# Multiply the product of A and B by the vector b
A%*%B%*%b

##          [,1]
## [1,] 1.414214
## [2,] 0.000000

# Multiply A on the right of B, and then by the vector b
B%*%A%*%b

##           [,1]
## [1,]  0.000000
## [2,] -1.414214

Intro to The Matrix Inverse

# Take the inverse of the 2 by 2 identity matrix
solve(diag(2))

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

# Take the inverse of the matrix A
Ainv <- solve(A)

# Multiply A inverse by A
Ainv%*%A

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

# Multiply A by its inverse
A%*%Ainv

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Ch. 2 - Matrix-Vector Equations

Motivation for Solving Matrix-Vector Equations

[Video]

The Meaning of Ax = b

A great deal of applied mathematics and statistics, as well as data science, ends in a matrix-vector equation of the form:

Ax = b

Which of the following is the most correct way to describe what solving this equation for x is trying to accomplish?

Finding the vector x that, upon some mysterious transformation, makes b.
Finding the vector x that is a linear combination of the elements of b.
[*] To produce b using a linear combination of the columns of A.
To produce b using a linear combination of the rows of A.

Exploring WNBA Data

# Print the Massey Matrix M
print(M)

##    Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York
## 1       33      -4          -2     -3      -3          -3        -3       -3
## 2       -4      33          -3     -3      -3          -3        -2       -3
## 3       -2      -3          34     -3      -3          -3        -3       -4
## 4       -3      -3          -3     34      -3          -4        -3       -3
## 5       -3      -3          -3     -3      33          -3        -3       -3
## 6       -3      -3          -3     -4      -3          41        -8       -3
## 7       -3      -2          -3     -3      -3          -8        41       -3
## 8       -3      -3          -4     -3      -3          -3        -3       34
## 9       -3      -3          -4     -2      -3          -6        -4       -3
## 10      -3      -3          -3     -3      -3          -3        -3       -2
## 11      -3      -3          -3     -3      -2          -2        -3       -3
## 12      -3      -3          -3     -4      -4          -3        -6       -4
##    Phoenix San.Antonio Seattle Washington
## 1       -3          -3      -3         -3
## 2       -3          -3      -3         -3
## 3       -4          -3      -3         -3
## 4       -2          -3      -3         -4
## 5       -3          -3      -2         -4
## 6       -6          -3      -2         -3
## 7       -4          -3      -3         -6
## 8       -3          -2      -3         -4
## 9       38          -3      -4         -3
## 10      -3          32      -4         -2
## 11      -4          -4      33         -3
## 12      -3          -2      -3         38

# Print the vector of point differentials f
print(f)

##    Differential
## 1          -135
## 2          -171
## 3           152
## 4          -104
## 5          -308
## 6           292
## 7           420
## 8            83
## 9            -4
## 10         -213
## 11           -5
## 12           -7
## 13            0

# Find the sum of the first column of M
sum(M[, 1])

## [1] 0

# Find the sum of the vector f
sum(f)

## [1] 0

Matrix-Vector Equations - Some Theory

[Video]

Why is a Matrix Not Invertible?

For our WNBA Massey Matrix model, some adjustments need to be made for a solution to our rating problem to exist and be unique.

To see this, notice that the following code produces an error:

` > print(M)
1 33 -4 -2 -3 -3 -3 -3 -3 -3 -3 -3 -3 2 -4 33 -3 -3 -3 -3 -2 -3 -3 -3 -3 -3 3 -2 -3 34 -3 -3 -3 -3 -4 -4 -3 -3 -3 4 -3 -3 -3 34 -3 -4 -3 -3 -2 -3 -3 -4 5 -3 -3 -3 -3 33 -3 -3 -3 -3 -3 -2 -4 6 -3 -3 -3 -4 -3 41 -8 -3 -6 -3 -2 -3 7 -3 -2 -3 -3 -3 -8 41 -3 -4 -3 -3 -6 8 -3 -3 -4 -3 -3 -3 -3 34 -3 -2 -3 -4 9 -3 -3 -4 -2 -3 -6 -4 -3 38 -3 -4 -3 10 -3 -3 -3 -3 -3 -3 -3 -2 -3 32 -4 -2 11 -3 -3 -3 -3 -2 -2 -3 -3 -4 -4 33 -3 12 -3 -3 -3 -4 -4 -3 -6 -4 -3 -2 -3 38

solve(M) Error in solve.default(M) : system is computationally singular: reciprocal condition number = 3.06615e-17 `

Which of the conditions does M explicitly violate in this case?

M is not a square matrix.
The determinant of M is zero.
The sum of each of the columns and rows is equal to zero.
[*] M does not have an inverse.

Understanding a Linear System’s Three Outcomes

In two dimensions, the solution structure of a system of two equations in two unknowns can be understood in a straightforward way via pictures, with the two equations representing lines (this is why it’s called linear algebra) in the x-y (or x1 - x2) plane. A solution is any point (x,y) ((x1,x2)) where the two lines intersect.

Which of the following three graphs is that of a linear system of two equations with two unknowns that has no solutions?

The first graph.
[*] The second graph.
The third graph.

Understanding the Massey Matrix

For our WNBA Massey Matrix model, some adjustments need to be made for a solution to our rating problem to exist and be unique.

This is because the matrix M, with R output 1 33 -4 -2 -3 -3 -3 -3 -3 -3 -3 -3 -3 2 -4 33 -3 -3 -3 -3 -2 -3 -3 -3 -3 -3 3 -2 -3 34 -3 -3 -3 -3 -4 -4 -3 -3 -3 4 -3 -3 -3 34 -3 -4 -3 -3 -2 -3 -3 -4 5 -3 -3 -3 -3 33 -3 -3 -3 -3 -3 -2 -4 6 -3 -3 -3 -4 -3 41 -8 -3 -6 -3 -2 -3 7 -3 -2 -3 -3 -3 -8 41 -3 -4 -3 -3 -6 8 -3 -3 -4 -3 -3 -3 -3 34 -3 -2 -3 -4 9 -3 -3 -4 -2 -3 -6 -4 -3 38 -3 -4 -3 10 -3 -3 -3 -3 -3 -3 -3 -2 -3 32 -4 -2 11 -3 -3 -3 -3 -2 -2 -3 -3 -4 -4 33 -3 12 -3 -3 -3 -4 -4 -3 -6 -4 -3 -2 -3 38

usually does not (computationally) have an inverse, as shown by the error produced from running solve(M) in a previous exercise.

One way we can change this is to add a row of 1’s on the bottom of the matrix M, a column of -1’s to the far right of M, and a 0 to the bottom of the vector of point differentials f⃗ .

What does that row of 1’s represent in the setting of rating teams? In other words, what does the final equation stipulate?

Each team gets an equal rating.
[*] The ratings for the entire league add to zero.
The sum of the ratings for the entire league is positive.
The sum of the ratings for the entire league is negative.

Adjusting the Massey Matrix

# Add a row of 1's
M_2 <- rbind(M, rep(1, 12))

# Add a column of -1's 
M_3 <- cbind(M_2, rep(-1, 13))

# Change the element in the lower-right corner of the matrix
M_3[13, 13] <- 1

# Print M_3
print(M_3)

##    Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York
## 1       33      -4          -2     -3      -3          -3        -3       -3
## 2       -4      33          -3     -3      -3          -3        -2       -3
## 3       -2      -3          34     -3      -3          -3        -3       -4
## 4       -3      -3          -3     34      -3          -4        -3       -3
## 5       -3      -3          -3     -3      33          -3        -3       -3
## 6       -3      -3          -3     -4      -3          41        -8       -3
## 7       -3      -2          -3     -3      -3          -8        41       -3
## 8       -3      -3          -4     -3      -3          -3        -3       34
## 9       -3      -3          -4     -2      -3          -6        -4       -3
## 10      -3      -3          -3     -3      -3          -3        -3       -2
## 11      -3      -3          -3     -3      -2          -2        -3       -3
## 12      -3      -3          -3     -4      -4          -3        -6       -4
## 13       1       1           1      1       1           1         1        1
##    Phoenix San.Antonio Seattle Washington rep(-1, 13)
## 1       -3          -3      -3         -3          -1
## 2       -3          -3      -3         -3          -1
## 3       -4          -3      -3         -3          -1
## 4       -2          -3      -3         -4          -1
## 5       -3          -3      -2         -4          -1
## 6       -6          -3      -2         -3          -1
## 7       -4          -3      -3         -6          -1
## 8       -3          -2      -3         -4          -1
## 9       38          -3      -4         -3          -1
## 10      -3          32      -4         -2          -1
## 11      -4          -4      33         -3          -1
## 12      -3          -2      -3         38          -1
## 13       1           1       1          1           1

Inverting the Massey Matrix

# Find the inverse of M
solve(M)

##                     [,1]         [,2]         [,3]         [,4]         [,5]
## Atlanta      0.032449804  0.005402927  0.003876665  0.004630004  0.004629590
## Chicago      0.005402927  0.032446789  0.004608094  0.004626913  0.004628272
## Connecticut  0.003876665  0.004608094  0.031714805  0.004613451  0.004629714
## Dallas       0.004630004  0.004626913  0.004613451  0.031707219  0.004649172
## Indiana      0.004629590  0.004628272  0.004629714  0.004649172  0.032447936
## Los.Angeles  0.004626242  0.004554829  0.004676789  0.005214940  0.004652111
## Minnesota    0.004611109  0.003985203  0.004651940  0.004727810  0.004678479
## New.York     0.004609212  0.004627729  0.005362761  0.004647832  0.004649262
## Phoenix      0.004610546  0.004608018  0.005295038  0.004013187  0.004613089
## San.Antonio  0.004630254  0.004631081  0.004608596  0.004609009  0.004587382
## Seattle      0.004629212  0.004631185  0.004646217  0.004595132  0.003854641
## Washington   0.004627769  0.004582295  0.004649264  0.005298666  0.005313685
## rep(-1, 13) -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                     [,6]         [,7]         [,8]         [,9]        [,10]
## Atlanta      0.004626242  0.004611109  0.004609212  0.004610546  0.004630254
## Chicago      0.004554829  0.003985203  0.004627729  0.004608018  0.004631081
## Connecticut  0.004676789  0.004651940  0.005362761  0.005295038  0.004608596
## Dallas       0.005214940  0.004727810  0.004647832  0.004013187  0.004609009
## Indiana      0.004652111  0.004678479  0.004649262  0.004613089  0.004587382
## Los.Angeles  0.027807608  0.007319076  0.004637275  0.006363490  0.004606288
## Minnesota    0.007319076  0.027810474  0.004677632  0.005388578  0.004578013
## New.York     0.004637275  0.004677632  0.031716432  0.004648253  0.003835528
## Phoenix      0.006363490  0.005388578  0.004648253  0.029212019  0.004646110
## San.Antonio  0.004606288  0.004578013  0.003835528  0.004646110  0.033267202
## Seattle      0.004032687  0.004573214  0.004607331  0.005265228  0.005427397
## Washington   0.004841998  0.006331805  0.005314087  0.004669776  0.003906474
## rep(-1, 13) -0.083333333 -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                    [,11]        [,12]        [,13]
## Atlanta      0.004629212  0.004627769 8.333333e-02
## Chicago      0.004631185  0.004582295 8.333333e-02
## Connecticut  0.004646217  0.004649264 8.333333e-02
## Dallas       0.004595132  0.005298666 8.333333e-02
## Indiana      0.003854641  0.005313685 8.333333e-02
## Los.Angeles  0.004032687  0.004841998 8.333333e-02
## Minnesota    0.004573214  0.006331805 8.333333e-02
## New.York     0.004607331  0.005314087 8.333333e-02
## Phoenix      0.005265228  0.004669776 8.333333e-02
## San.Antonio  0.005427397  0.003906474 8.333333e-02
## Seattle      0.032485332  0.004585756 8.333333e-02
## Washington   0.004585756  0.029211757 8.333333e-02
## rep(-1, 13) -0.083333333 -0.083333333 2.220446e-16

Solving Matrix-Vector Equations

[Video]

An Analogy with Regular Algebra

As we saw in the video, solving matrix-vector equations is as simple as multiplying both sides of the equation by A’s inverse, A−1, should it exist. The analogy with solving linear equations like 5x=7 is a good one.

If A−1 doesn’t exist, this does not work. The equivalent analogy for linear equations would be a situation in which the coefficient in front of the x were 0, which is the only real number that does not have an inverse. Which of the following does NOT analogize in this situation?

Dividing by zero is illegal, and is analogous to trying to invert a matrix with a zero determinant.
[*] All of the elements of a matrix must be zero for it to fail to have an inverse.
The equation 0x=b has zero solutions (if b≠0).
The equation 0x=b has infinitely many solutions (if b=0).

2017 WNBA Ratings!

# Solve for r and rename column
r <- solve(M)%*%f
colnames(r) <- "Rating"

# Print r
print(r)

##                    Rating
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## Connecticut  4.309525e+00
## Dallas      -2.608129e+00
## Indiana     -8.532958e+00
## Los.Angeles  7.850327e+00
## Minnesota    1.061241e+01
## New.York     2.541565e+00
## Phoenix      8.979110e-01
## San.Antonio -6.181574e+00
## Seattle     -2.666953e-01
## Washington   5.468121e-01
## WNBA         1.043610e-14

Who Was the Champion?

The dplyr package has been loaded for you, as has the solution to the previous question. The arrange() function in dplyr allows you to re-order a vector based on a trait.

In the previous exercise, you rated the teams at the end of the 2017 WNBA season using the solution to a matrix-vector equation.

Using the the syntax

arrange(r, -Rating)

we can see which team was the best in the WNBA in 2017 (using the negative (“-”) sign in front of the ordering variable (“Rating”) puts the values in descending order, as opposes to ascending order if just “Rating” is used).

Which team was the best?

# arrange(r, -Rating)

San Antonio
[*] Minnesota
Los Angeles
Phoenix

Other Considerations for Matrix-Vector Equations

[Video]

Other Methods for Matrix-Vector Equations

Which of the following was NOT proposed as a method to solve matrix-vector equations with non-square matrices?

[*] Euler’s method
Least squares
Singular Value Decomposition
Row reduction

Alternatives to the Regular Matrix Inverse

# Print M
print(M)

##       Atlanta Chicago Connecticut Dallas Indiana Los.Angeles Minnesota New.York
##  [1,]      33      -4          -2     -3      -3          -3        -3       -3
##  [2,]      -4      33          -3     -3      -3          -3        -2       -3
##  [3,]      -2      -3          34     -3      -3          -3        -3       -4
##  [4,]      -3      -3          -3     34      -3          -4        -3       -3
##  [5,]      -3      -3          -3     -3      33          -3        -3       -3
##  [6,]      -3      -3          -3     -4      -3          41        -8       -3
##  [7,]      -3      -2          -3     -3      -3          -8        41       -3
##  [8,]      -3      -3          -4     -3      -3          -3        -3       34
##  [9,]      -3      -3          -4     -2      -3          -6        -4       -3
## [10,]      -3      -3          -3     -3      -3          -3        -3       -2
## [11,]      -3      -3          -3     -3      -2          -2        -3       -3
## [12,]      -3      -3          -3     -4      -4          -3        -6       -4
## [13,]       1       1           1      1       1           1         1        1
##       Phoenix San.Antonio Seattle Washington WNBA
##  [1,]      -3          -3      -3         -3   -1
##  [2,]      -3          -3      -3         -3   -1
##  [3,]      -4          -3      -3         -3   -1
##  [4,]      -2          -3      -3         -4   -1
##  [5,]      -3          -3      -2         -4   -1
##  [6,]      -6          -3      -2         -3   -1
##  [7,]      -4          -3      -3         -6   -1
##  [8,]      -3          -2      -3         -4   -1
##  [9,]      38          -3      -4         -3   -1
## [10,]      -3          32      -4         -2   -1
## [11,]      -4          -4      33         -3   -1
## [12,]      -3          -2      -3         38   -1
## [13,]       1           1       1          1    1

# Find the rating vector the conventional way
r <- solve(M)%*%f
colnames(r) <- "Rating"
print(r)

##                    Rating
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## Connecticut  4.309525e+00
## Dallas      -2.608129e+00
## Indiana     -8.532958e+00
## Los.Angeles  7.850327e+00
## Minnesota    1.061241e+01
## New.York     2.541565e+00
## Phoenix      8.979110e-01
## San.Antonio -6.181574e+00
## Seattle     -2.666953e-01
## Washington   5.468121e-01
## WNBA         1.043610e-14

# Find the rating vector using ginv
r <- ginv(M)%*%f
colnames(r) <- "Rating"
print(r)

##              Rating
##  [1,] -4.012938e+00
##  [2,] -5.156260e+00
##  [3,]  4.309525e+00
##  [4,] -2.608129e+00
##  [5,] -8.532958e+00
##  [6,]  7.850327e+00
##  [7,]  1.061241e+01
##  [8,]  2.541565e+00
##  [9,]  8.979110e-01
## [10,] -6.181574e+00
## [11,] -2.666953e-01
## [12,]  5.468121e-01
## [13,]  5.773160e-14

Ch. 3 - Eigenvalues and Eigenvectors

Intro to Eigenvalues and Eigenvectors

[Video]

Matrix-Vector Multiplications

Rotations
Reflections
Dilations
Contradictions
Projections
Every imaginable combinations of these

Scalar Multiplication

c times vector \(\vec{x}\) Notation: \(c\vec{x}\)

Interpreting Scalar Multiplication

Scaling Different Axes

Definition of Eigenvalues and Eigenvectors

Why “Eigen”?

Finding Eigenvalues in R

Scalar Multiplies of Eigenvectors are Eigenvectors

Computing Eigenvalues and Eigenvectors in R

How Many Eigenvalues?

Verifying the Math on Eigenvalues

Computing Eigenvectors in R

Some More on Eigenvalues and Eigenvectors

Eigenvalue Ordering

Markov Models for Allele Frequencies

Ch. 4 - Principal Component Analysis

Intro to the Idea of PCA

[Video]

What Does “Big Data” Mean?

In data science, the term “big data” is generally referring to what with the term “big”?

[*] The number of rows and the number of columns.
The number of rows.
The number of columns.
The number of rows or the number of columns.

Finding Redundancies

# Print the first 6 observations of the dataset
head(combine)

##              player position       school year height weight forty vertical
## 1   Jaire Alexander       CB   Louisville 2018     71    192  4.38     35.0
## 2       Brian Allen        C Michigan St. 2018     73    298  5.34     26.5
## 3      Mark Andrews       TE     Oklahoma 2018     77    256  4.67     31.0
## 4         Troy Apke        S     Penn St. 2018     74    198  4.34     41.0
## 5 Dorance Armstrong     EDGE       Kansas 2018     76    257  4.87     30.0
## 6         Ade Aruna       DE       Tulane 2018     78    262  4.60     38.5
##   bench broad_jump three_cone shuttle
## 1    14        127       6.71    3.98
## 2    27         99       7.81    4.71
## 3    17        113       7.34    4.38
## 4    16        131       6.56    4.03
## 5    20        118       7.12    4.23
## 6    18        128       7.53    4.48
##                                       drafted
## 1  Green Bay Packers / 1st / 18th pick / 2018
## 2  Los Angeles Rams / 4th / 111th pick / 2018
## 3   Baltimore Ravens / 3rd / 86th pick / 2018
## 4                                            
## 5                                            
## 6 Minnesota Vikings / 6th / 218th pick / 2018

# Find the correlation between variables forty and three_cone
cor(combine$forty, combine$three_cone)

## [1] 0.8315171

# Find the correlation between variables vertical and broad_jump
cor(combine$vertical, combine$broad_jump)

## [1] 0.8163375

Given the results of the previous parts of the exercise, what can you say about the dataset combine at this point?

We have yet to find any redundancy in the dataset.
forty and three_cone are the only redundant variables we’ve found so far.
vertical and broad_jump are the only redundant variables we’ve found so far.
[*] There are at least two sets of redundant variables in this dataset.

The Linear Algebra Behind PCA

[Video]

Covariance Explored

If the covariance between two columns of a matrix is positive and large, what can we say?

The variables are not related.
When one of the variables goes up, the other goes down.
[*] When one of the variables goes up, the other goes up as well.
The variables are related, but we don’t know how.

Standardizing Your Data

# Extract columns 5-12 of combine
A <- combine[, 5:12]

# Make A into a matrix
A <- as.matrix(A)

# Subtract the mean of each column
A[, 1] <- A[, 1] - mean(A[, 1])
A[, 2] <- A[, 2] - mean(A[, 2])
A[, 3] <- A[, 3] - mean(A[, 3])
A[, 4] <- A[, 4] - mean(A[, 4])
A[, 5] <- A[, 5] - mean(A[, 5])
A[, 6] <- A[, 6] - mean(A[, 6])
A[, 7] <- A[, 7] - mean(A[, 7])
A[, 8] <- A[, 8] - mean(A[, 8])

Variance/Covariance Calculations

# Create matrix B from equation in instructions
B <- t(A)%*%A/(nrow(A) - 1)

# Compare 1st element of the 1st column of B to the variance of the first column of A
B[1,1]

## [1] 7.159794

var(A[, 1])

## [1] 7.159794

# Compare 1st element of 2nd column of B to the 1st element of the 2nd row of B to the covariance between the first two columns of A
B[1, 2]

## [1] 90.78808

B[2, 1]

## [1] 90.78808

cov(A[, 1], A[, 2])

## [1] 90.78808

Eigenanalyses of Combine Data

# Find eigenvalues of B
V <- eigen(B)

# Print eigenvalues
V$values

## [1] 2.187628e+03 4.403246e+01 2.219205e+01 5.267129e+00 2.699702e+00
## [6] 6.317016e-02 1.480866e-02 1.307283e-02

Where’s the Variance?

The eigenvalues of B are, when rounding to four digits,

2187.6283 44.0325 22.1921 5.2671 2.6997 0.0632 0.0148 0.0131

Roughly how much of the variability in the dataset can be explained by the first principal component?

About 15 percent.
About 50 percent.
About 75 percent.
[*] About 95 percent.

Performing PCA in R

[Video]

Scaling Data Before PCA

# Scale columns 5-12 of combine
B <- scale(combine[, 5:12])

# Print the first 6 rows of the data
head(B)

##           height      weight      forty   vertical      bench  broad_jump
## [1,] -1.11844839 -1.30960025 -1.3435337  0.5624657 -1.1089286  1.45502476
## [2,] -0.37100257  1.00066356  1.6449741 -1.4281627  0.9238361 -1.49512459
## [3,]  1.12388907  0.08527601 -0.4407553 -0.3743006 -0.6398290 -0.02004991
## [4,]  0.00272034 -1.17883060 -1.4680548  1.9676151 -0.7961955  1.87647467
## [5,]  0.75016616  0.10707096  0.1818505 -0.6084922 -0.1707295  0.50676247
## [6,]  1.49761199  0.21604566 -0.6586673  1.3821362 -0.4834625  1.56038724
##       three_cone    shuttle
## [1,] -1.38083506 -1.5879750
## [2,]  1.16888714  1.1170258
## [3,]  0.07946038 -0.1057828
## [4,] -1.72852445 -1.4027010
## [5,] -0.43048406 -0.6616049
## [6,]  0.51986694  0.2647653

# Summarize the principal component analysis
summary(prcomp(B))

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3679 0.9228 0.78904 0.61348 0.46811 0.37178 0.34834
## Proportion of Variance 0.7009 0.1064 0.07782 0.04704 0.02739 0.01728 0.01517
## Cumulative Proportion  0.7009 0.8073 0.88514 0.93218 0.95957 0.97685 0.99202
##                            PC8
## Standard deviation     0.25266
## Proportion of Variance 0.00798
## Cumulative Proportion  1.00000

Summarizing PCA in R

# Subset combine only to "WR"
combine_WR <- subset(combine, position == "WR")

# Scale columns 5-12 of combine_WR
B <- scale(combine_WR[, 5:12])

# Print the first 6 rows of the data
head(B)

##        height      weight       forty   vertical      bench  broad_jump
## 7   1.4022982  0.88324903  1.20674474 -0.3430843 -0.3223377  0.07414249
## 17  0.5575402 -0.09700717 -0.80129388 -0.4969965 -0.7938424 -0.95388361
## 18  0.9799192  1.58343202  0.88968601  1.0421255  0.8564239  1.61618163
## 25  0.9799192  1.16332222  1.41811723 -1.5743819 -0.7938424 -1.29655897
## 29 -1.1319757 -1.56739147 -0.80129388 -0.1891721 -0.0865854 -1.29655897
## 46  0.1351613  0.11304773  0.04419607  0.2725645 -1.0295947  0.24548017
##      three_cone     shuttle
## 7   0.712845019  0.02833449
## 17 -1.098542478  0.84141123
## 18 -1.853287268 -1.46230619
## 25 -1.148858797  0.50262926
## 29  0.008416548 -0.64922946
## 46  0.109049187  0.84141123

# Summarize the principal component analysis
summary(prcomp(B))

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5425 1.4255 1.0509 0.9603 0.77542 0.63867 0.59792
## Proportion of Variance 0.2974 0.2540 0.1380 0.1153 0.07516 0.05099 0.04469
## Cumulative Proportion  0.2974 0.5514 0.6894 0.8047 0.87987 0.93085 0.97554
##                            PC8
## Standard deviation     0.44235
## Proportion of Variance 0.02446
## Cumulative Proportion  1.00000

Does Subsetting Change Things?

In the last exercise, you looked at the PCA analysis of just the wide receivers in the NFL combine data. The summaries of the PCA analysis for the whole combine dataset and the wide receiver subset are loaded as pca_summary and pca_summary_wr, respectively.

What is true about this data in relation to the dataset as a whole?

With less data, the first PC of the subset data explains more of the variability in the dataset.
The first PC explains similar amounts of variability for both datasets.
[*] It takes the first 3 PCs of the subset data to explain the same amount of variability as the first PC of the whole dataset.

Wrap-Up

[Video]

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470