Principal Component Analysis

Principal Component Analysis (PCA) is very useful in finance to reduce number of risk factors to a managerable dimension.

Alright, what is PCA? Let's have a look on the following examples.

Let X and Y be random variables where \( X \sim N(2,2) \), \( Y \sim N(4,4) \), \( corr(X,Y)=0.8, cov(X,Y) = corr(X,Y)*sqrt(2)*sqrt(4) \)

We generate 1000 observations of X and Y in R by using function 'mvrnorm' in the package 'MASS'

library("MASS")
set.seed(7)
mu.x <- 0
mu.y <- 0
sigma.x <- sqrt(2)
sigma.y <- sqrt(4)
corr.xy <- 0.8
cov.xy <- corr.xy * sigma.x * sigma.y
Sigma.xy <- matrix(c(sigma.x^2, cov.xy, cov.xy, sigma.y^2), nrow = 2, byrow = T)
# generate the data
data <- mvrnorm(1000, mu = c(mu.x, mu.y), Sigma = Sigma.xy)
colnames(data) <- c("X", "Y")
head(data)

##            X       Y
## [1,]  2.1630  4.9775
## [2,] -1.0629 -2.6492
## [3,] -1.0180 -1.2755
## [4,] -0.3050 -0.9525
## [5,] -0.7407 -2.2278
## [6,] -1.5636 -1.6265

df = data.frame(data)

Let's have a quick check on the generated data

# mean and variance of X, should equal to 2,2
mean(data[, 1])

## [1] -0.007433

var(data[, 1])

## [1] 1.918


# mean and variance of Y, should equal to 4,4
mean(data[, 2])

## [1] 0.01335

var(data[, 2])

## [1] 3.915


# Correlation Matrix of X and Y, should be equal to 0.8
cor(data)

##       X     Y
## X 1.000 0.783
## Y 0.783 1.000

Now, let's plot the generated random variables X and Y.

library(ggplot2)
library(grid)

g = ggplot(df, aes(X, Y)) + geom_point() + geom_point(data = NULL, aes(x = 0, 
    y = 0, colour = "red"), show_guide = FALSE)
plot(g)

plot of chunk unnamed-chunk-3

Assuming that we don't know the underlying distriubtion of both X and Y, and we want to find out the common risk factors of X and Y. PCA is one of the solutions for you.

The first step of performing PCA is to find out the covariance matrix \( \Sigma \) of the variables (or correlation matrix), followed by finding out their eigenvalues and eigenvectors In R, covariance of X and Y can be obtained by:

Sigmahat = cov(data)
e = eigen(Sigmahat)

The two eigenvalues (\( \lambda_1 \), \( \lambda_2 \)) and their eigenvectors (\( w_1, w_2 \)) are:

e$values[1]

## [1] 5.282

e$values[2]

## [1] 0.5498


e$vectors[, 1]

## [1] 0.5376 0.8432

e$vectors[, 2]

## [1] -0.8432  0.5376

Next, we rank the eigenvalue from largest to lowest and sort the correposing eigenvector to create a matrix \( W \) (Actually, R sorted the result for you already, the following code is NOT necessary but for illustration purpose only.

s = sort(e$values, decreasing = T, index.return = T)
# find out the dimension of eignvectors
n = length(e$vectors[, 1])
W = matrix(nrow = n, ncol = n)
for (i in 1:n) {
    W[, i] = e$vectors[, s$ix[i]]
}
W

##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376

Principal Component Representation

We denote the values of X and Y in a matrix \( D \), the principal components of the volatility matrix \( \Sigma \) is defined by: \[ P = DW \]

D <- data
P <- D %*% W

The total variation that in the data \( D \) is the sum of the eigenvalues of \( \Sigma \), which is \( \lambda_1+\lambda_2 \). The total variation explained by the mth principal component is \( \frac{\lambda_m}{\sum \lambda_i} \). In our case, the 1st principal component explained about 90.5728 percent of the variation of X.

e$values[1]/(e$values[1] + e$values[2])

## [1] 0.9057

So, to reduce the dimension of our data, we may reconstruct the data \( D \) by multiply P with the orthogonal eigenvectors matrix \( W \) (\( W^{-1} = W' \)): \[ D = PW^{-1} =PW' \] We pick the first largest \( k \) columns of \( P \) be \( P^* \) and take the \( k \) columns of \( W \) be \( W^* \). \[ D \sim P^* W^{*'} \]

In our example:

\( D = PW \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] = [\begin{array}{cc} 5.3598 & 0.8519 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] = [\begin{array}{cc} -2.8052 & -0.5279 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\( D \sim P^*W^{*'} \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] \sim [5.3598] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array} ] \sim [\begin{array}{cc} 2.8809 & 4.5194 \end{array} ] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] \sim [-2.8052] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array}] \sim [\begin{array}{cc} -1.5078 & -2.3653 \end{array} ] \]

The matrices W and W* are:

##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376

as.matrix(W[, 1])

##        [,1]
## [1,] 0.5376
## [2,] 0.8432

PCA for 10 Random Variables with 1,000 Observations

PCA on 2 variables may not convince you it is a useful technique, let's have a look on the example of 10 RVs, you may change your mind :)

Since PCA is useful in highly correlated data, we generate 10 RVs with random variances and high correlations among them.

library("MASS")
set.seed(7)

# limit sigma between 2 to 6, mu between -5 to 5
sigma.10 <- runif(10, 2, 6)
mu.10 <- runif(10, -5, 5)
# correlations are set to be high, from 0.9 to 1
corr.data <- runif((10 * 9)/2, 0.9, 1)


corr.mat <- diag(10)
corr.mat[lower.tri(corr.mat)] <- corr.data
corr.mat <- corr.mat %*% t(corr.mat)  #LU decomposition
var.mat <- diag(10) * sigma.10
cov.mat <- var.mat %*% corr.mat %*% var.mat

D <- mvrnorm(1000, mu = mu.10, Sigma = cov.mat)
colnames(D) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

We now have 1000 observations of random variables X1, X2, …, X10, the first 6 rows of the observations are:

head(D)

##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366

Now, we apply PCA to reduce the dimension of the data

Sigmahat = cov(D)
e = eigen(Sigmahat)
W = cbind(e$vectors)

Let's examine the eigenvalues

e$values

##  [1] 691.743  58.670  30.988  16.657  11.989   6.396   4.847   3.054
##  [9]   3.007   1.593

With the first 4 eigenvectors, we can explain more than 96.2742 percent of the variation of the data D.

sum(e$values[1:4])/sum(e$values) * 100

## [1] 96.27

Let's compare the first 4 columns in \( P \) and \( W \) be \( P^* \) and \( W^* \)

P = D %*% W

D.hat = P[, 1:4] %*% t(W[, 1:4])
colnames(D.hat) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

Now, let's look at the difference between the orginal data \( D \) and regenerate data \( \hat{D} \)

head(D)

##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366

head(D.hat)

##             X1     X2     X3     X4      X5      X6     X7      X8      X9
## [1,]  -7.51019 -6.290 -5.389 -5.777  -8.319 -13.837 -7.337  -8.501  0.4882
## [2,]   1.83168  1.640  1.991  2.595   3.742   4.632 -2.706 -13.709 -1.4306
## [3,]   4.55053  3.337  2.220  1.886   2.290   3.053  1.767   1.893 -3.1759
## [4,] -12.21744 -9.699 -7.920 -8.310 -11.048 -15.698 -3.805   4.080  1.6485
## [5,]  -0.02942 -1.028 -2.337 -3.491  -6.366 -12.421 -6.164  -5.554 -0.9276
## [6,]  -7.54581 -5.473 -3.311 -2.639  -2.574  -3.151 -3.451  -8.756 -0.6881
##          X10
## [1,]  5.6071
## [2,]  4.6427
## [3,] -9.4474
## [4,]  0.6940
## [5,]  0.6069
## [6,]  2.6988

And let's plot the first 4 eigenvectors

Final remark

Actually, R has a built-in function 'princomp' for performing PCA, but I think it is worthwhile to learn PCA in brute force programming.

p = princomp(D)
plot(p)

plot of chunk unnamed-chunk-17

summary(p)

## Importance of components:
##                         Comp.1  Comp.2  Comp.3  Comp.4  Comp.5   Comp.6
## Standard deviation     26.2879 7.65580 5.56387 4.07930 3.46075 2.527669
## Proportion of Variance  0.8345 0.07078 0.03738 0.02009 0.01446 0.007715
## Cumulative Proportion   0.8345 0.90527 0.94265 0.96274 0.97720 0.984920
##                          Comp.7   Comp.8   Comp.9  Comp.10
## Standard deviation     2.200474 1.746596 1.733065 1.261686
## Proportion of Variance 0.005847 0.003684 0.003627 0.001922
## Cumulative Proportion  0.990767 0.994451 0.998078 1.000000

What we have done so far is: We reduced the data dimensions while preserved around 96% of variability the orginal data.

I'll show you how PCA be used in finance area in the coming future. Let's take a break now.