Hi, Team!
Many of your are struggling with the concept of principal components (PCs) & Eigenspace. I am going to break it down, starting with principal components.
In principal component analysis (PCA), we seek to iteratively generate a linear combination of weighted columns in a dataset that 1) capture as much of the variance of the data as possible in turn by optimizing the weights and that are of 2) unit size 1 and that 3) are orthogonal. In order to maximize that variance capture, we have to set the sum of the weights for each component equal to 1.
Let’s define a small dataset with variables we think may be related, inpatient and outpatient workload in a hospital.
mydf=data.frame(matrix(c(1,3,2,4,4,7,4,9), ncol=2))
colnames(mydf)=c('Outpatient Workload', 'Inpatient Workload')
mydf
## Outpatient Workload Inpatient Workload
## 1 1 4
## 2 3 7
## 3 2 4
## 4 4 9
In fact, we can see that the variables are indeed correlation. With this high correlation, we might select a linear combination of the two variables that maximizes variance capture.
(mycor=cor(mydf))
## Outpatient Workload Inpatient Workload
## Outpatient Workload 1.0000000 0.9486833
## Inpatient Workload 0.9486833 1.0000000
To maximize variance capture for our first prinicipal component, we build a constrained optimization problem. Let’s assume that we have weights that will be multiplied by the columns of our dataframe. For the the first PC, we have weights \(a_{1,1}\) and \(a_{1,2}\). The first subindex is the prinicipal component number and the second is the column to which it applies, 1=Outpatient and 2=Inpatient.
Then we simply want to do the following, defining the our dataframe as \(X\) and our weights for the first PC as vector \(a_1\). We want the \(Var(Xa_{1})\) which is equal to \(a_{1}^T\Sigma a_{1}\), where \(\Sigma\) is the covariance matrix or correlation matrix if standardized. \(\lambda_1\) is used as a Lagrangian multiplier to ensure that we have unit size of 1. We don’t need to worry about orthogonality yet, since this is our first PC. Subsequent ones will have that constraint.
\[Max_{a_{1}, \lambda_1} L=a_{1}^T\Sigma a_{1} -\lambda_1 (a_1^T{a_1-1})\]
Taking the partial derivatives for \(a_1\) and \(\lambda\) and setting them to zero, we have the following.
\[\frac{\partial L}{\partial a_1}=2 \Sigma a_1 -2 \lambda_1 a_1=0\]
\[\frac{\partial L}{\partial \lambda_1}=a_1^T a_1-1=0\]
Solving for the \(\frac{\partial L}{\partial a_1}\)..
\[2 \Sigma a_1 -2 \lambda_1 a_1=0=\Sigma a_1- \lambda_1 a_1=0\]
Wow! What is \(\Sigma a_1- \lambda_1 a_1=0\)? This is the way we calculate an Eigenvalue! And the associated vector is then…wait for it… an Eigenvector!
\[ \mathbf{\Sigma a_1-\lambda a_1} = \left[\begin{array} \\1.000 -\lambda_1 & 0.9486833 \\ 0.9486833 & 1.000 - \lambda_1 \\ \end{array}\right]= 0 \]
\[1 - 2\lambda_1+\lambda_1 ^2 - 0.9486833^2 = 0\]
\[0.1-2\lambda_1+\lambda_1^2=0\]
\[\lambda = 1.9486833\]
So, the first principal component is based on \(\lambda=1.9486833\).Then our vector is nothing more than
\[\Sigma a_1 = \lambda_1 a_1\].
We can verify that PC functions generate the same variance.
WHOA… lambda_1 is actually the variance component of the PC!!!! Nice finding!
And we can also verify that the Eigenvalue associated with this component is ALSO the same and ALSO the variance.
lambda=1.9486833
names(lambda)='Lambda Manually'
mypca=princomp(x = mydf, cor = T)
lambda2=mypca$sdev[1]^2
names(lambda2)="Lambda via princomp()"
lambda3=eigen(cor(mydf))$values[1]
names(lambda3)="Lambda via eigen()"
print(c(lambda, lambda2, lambda3))
## Lambda Manually Lambda via princomp() Lambda via eigen()
## 1.948683 1.948683 1.948683
\[a_1=[a_{11}, a_{12}]\]
\[(1.000-\lambda_1)a_{11} + 0.9486833a_{12} =0\]
\[0.9486833a_{11} + (1.000-\lambda_1)a_{12}=0\]
Solving, we have the following.
\[a_{1,1}=0.7071068, a_{1,2}=0.7071068\]
(1-1.948683)*0.7071068+0.9486833*0.7071068
## [1] 2.12132e-07
0.9486833*0.7071068+(1-1.948683)*0.7071068
## [1] 2.12132e-07
Obviously, the way this works is that the first PC is generated from the Eigenvector
myeig=eigen(cor(mydf))
myeig$values[1]
## [1] 1.948683
myeig$vectors[,1]
## [1] 0.7071068 0.7071068
We can use the weights to find the new variable that represents our first column by standardizing our variables (which is literally what correlation does!)
myf=function(x) (x-mean(x))/(sqrt(.75)*sd(x)) #sqrt(.75)*sd(x) converts sample sd to population sd
newdf=apply(mydf, 2, myf)
pc1=newdf[,1]*0.7071068+newdf[,2]*0.7071068
pc1
## [1] -1.6153500 0.6495611 -0.9828945 1.9486833
This exactly matches the PC vector produced by our function.
mypca$scores[,1]
## [1] -1.6153500 0.6495611 -0.9828944 1.9486833
For the second PC, we simply add the orthogonality constraint in the nonlinear optimization!
\[Max_{a_{2}, \lambda_2} L= a_{2}^T\Sigma a_{2} -\lambda_2 (a_2^T{a_2-1})-\beta_2(a_1^Ta_2cos(90))\]
Taking the partial derivatives for \(a_2\), \(\lambda_2\), and \(\beta_2\) and setting them to zero, we have the following. NOTE: there is no \(\beta_1\) here, and we already know \(a_1\) and \(\lambda_1\).
\[\frac{\partial L}{\partial a_2}=2 \Sigma a_2 -2 \lambda_2 a_2-\beta_2 a_1cos(90) =0\]
\[\frac{\partial L}{\partial \lambda_2}=a_2^T a_2-1=0\]
\[\frac{\partial L}{\partial \beta_2}=a_1^Ta_2=0\]
Working with \(\frac{\partial L}{\partial a_2}\)..
\[\frac{\partial L}{\partial a_2}=2 \Sigma a_2 -2 \lambda_2 a_2=\beta_2 a_1cos(90)\]
\[\frac{\partial L}{\partial a_2}= \Sigma a_2 - \lambda_2a_2 =0\]
And you should recognize that as the second largest Eigenvalue (as well as the second PC).
The variance of each component is the the Eigenvalue divided by the sum of the Eigenvalues.
myeig
## eigen() decomposition
## $values
## [1] 1.9486833 0.0513167
##
## $vectors
## [,1] [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068 0.7071068
In this case, the first component accounts for 1.9486833/ (1.9486833+0.0513167) = 0.974 percent of the variance! Why would we need to use two variables when one accounts for both???
1.9486833/ (1.9486833+0.0513167)
## [1] 0.9743416