Data

The data consists of SAT scores for the 50 states in 1994-1995:

SAT <- read.csv("http://people.reed.edu/~jones/141/sat.csv")
head(SAT[c("State", "salary", "expend")], 2)
##     State salary expend
## 1 Alabama 31.144  4.405
## 2  Alaska 47.951  8.963

Our observations \(\vec{X}\) are in \(\mathbb{R}^2\) where:

Let X be the \(n\times 2\) matrix of \(n\) observations and we compute the observed \(2 \times 2\) covariance matrix \(\Sigma\). Note this is an empirical estimate based on observations, the way \(\overline{x}\) is an empirical estimate of the true mean \(\mu\).

X <- SAT[, c("salary", "expend")]
Sigma <- cov(X)
round(Sigma, 3)
##        salary expend
## salary 35.299  7.043
## expend  7.043  1.857

The two variables salary and expend are very correlated \(\rho=\) 0.87; you can think of them as being redundant i.e. once you know one variable, the other variable doesn’t provide you with all that more information. Note such collinear variables are a problem in regression, because they “steal” each other’s effect, so it becomes difficult to isolate the effect of one vs the other. We plot the two variables and also the two variables recentered at \(\vec{0}\).

Principal Components

Using R’s built in linear algebra functionlity, we compute the eigenvectors and eigenvalues and plot the first and second eigenvectors i.e. the first and second principal components.

We define the transformed variables via the multiplication \(\gamma_{(i)}^T (\vec{x} - \vec{\mu})\) for all 50 observed \(\vec{x}\) and plot the resulting \(n \times 2\) matrix Y:

We observe that the principal components all have mean 0 and are uncorrelated!

Who cares?

The sum of the variances of the \(\vec{X}\) equals the sum of the variances of the \(\vec{Y}\).

round(cov(X), 5)
##          salary  expend
## salary 35.29863 7.04260
## expend  7.04260 1.85724
round(cov(Y), 5)
##          PC1     PC2
## PC1 36.72125 0.00000
## PC2  0.00000 0.43462

Most importantly, since the eigenvalues are the variances of the principal components:

eigen.vals
## [1] 36.7212536  0.4346205
cumsum(eigen.vals)/sum(eigen.vals)
## [1] 0.9883028 1.0000000

We observe that 98% of the variability of the salary and expenditure variables is explained by the first principal component! So one would be inclined to use the single variable \(Y_1\) and call it “school funding”. Unfortuately the units don’t make any intuitive sense.