Principal Component Analysis (PCA) is very useful in finance to reduce number of risk factors to a managerable dimension.
Alright, what is PCA? Let's have a look on the following examples.
Let X and Y be random variables where \( X \sim N(2,2) \), \( Y \sim N(4,4) \), \( corr(X,Y)=0.8, cov(X,Y) = corr(X,Y)*sqrt(2)*sqrt(4) \)
We generate 1000 observations of X and Y in R by using function 'mvrnorm' in the package 'MASS'
library("MASS")
set.seed(7)
mu.x <- 0
mu.y <- 0
sigma.x <- sqrt(2)
sigma.y <- sqrt(4)
corr.xy <- 0.8
cov.xy <- corr.xy * sigma.x * sigma.y
Sigma.xy <- matrix(c(sigma.x^2, cov.xy, cov.xy, sigma.y^2), nrow = 2, byrow = T)
# generate the data
data <- mvrnorm(1000, mu = c(mu.x, mu.y), Sigma = Sigma.xy)
colnames(data) <- c("X", "Y")
head(data)
## X Y
## [1,] 2.1630 4.9775
## [2,] -1.0629 -2.6492
## [3,] -1.0180 -1.2755
## [4,] -0.3050 -0.9525
## [5,] -0.7407 -2.2278
## [6,] -1.5636 -1.6265
df = data.frame(data)
Let's have a quick check on the generated data
# mean and variance of X, should equal to 2,2
mean(data[, 1])
## [1] -0.007433
var(data[, 1])
## [1] 1.918
# mean and variance of Y, should equal to 4,4
mean(data[, 2])
## [1] 0.01335
var(data[, 2])
## [1] 3.915
# Correlation Matrix of X and Y, should be equal to 0.8
cor(data)
## X Y
## X 1.000 0.783
## Y 0.783 1.000
Now, let's plot the generated random variables X and Y.
library(ggplot2)
library(grid)
g = ggplot(df, aes(X, Y)) + geom_point() + geom_point(data = NULL, aes(x = 0,
y = 0, colour = "red"), show_guide = FALSE)
plot(g)
Assuming that we don't know the underlying distriubtion of both X and Y, and we want to find out the common risk factors of X and Y. PCA is one of the solutions for you.
The first step of performing PCA is to find out the covariance matrix \( \Sigma \) of the variables (or correlation matrix), followed by finding out their eigenvalues and eigenvectors In R, covariance of X and Y can be obtained by:
Sigmahat = cov(data)
e = eigen(Sigmahat)
The two eigenvalues (\( \lambda_1 \), \( \lambda_2 \)) and their eigenvectors (\( w_1, w_2 \)) are:
e$values[1]
## [1] 5.282
e$values[2]
## [1] 0.5498
e$vectors[, 1]
## [1] 0.5376 0.8432
e$vectors[, 2]
## [1] -0.8432 0.5376
Next, we rank the eigenvalue from largest to lowest and sort the correposing eigenvector to create a matrix \( W \) (Actually, R sorted the result for you already, the following code is NOT necessary but for illustration purpose only.
s = sort(e$values, decreasing = T, index.return = T)
# find out the dimension of eignvectors
n = length(e$vectors[, 1])
W = matrix(nrow = n, ncol = n)
for (i in 1:n) {
W[, i] = e$vectors[, s$ix[i]]
}
W
## [,1] [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432 0.5376
We denote the values of X and Y in a matrix \( D \), the principal components of the volatility matrix \( \Sigma \) is defined by: \[ P = DW \]
D <- data
P <- D %*% W
The total variation that in the data \( D \) is the sum of the eigenvalues of \( \Sigma \), which is \( \lambda_1+\lambda_2 \). The total variation explained by the mth principal component is \( \frac{\lambda_m}{\sum \lambda_i} \). In our case, the 1st principal component explained about 90.5728 percent of the variation of X.
e$values[1]/(e$values[1] + e$values[2])
## [1] 0.9057
So, to reduce the dimension of our data, we may reconstruct the data \( D \) by multiply P with the orthogonal eigenvectors matrix \( W \) (\( W^{-1} = W' \)): \[ D = PW^{-1} =PW' \] We pick the first largest \( k \) columns of \( P \) be \( P^* \) and take the \( k \) columns of \( W \) be \( W^* \). \[ D \sim P^* W^{*'} \]
In our example:
\( D = PW \) consider the 1st and 2nd rows of \( D \):
\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] = [\begin{array}{cc} 5.3598 & 0.8519 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]
\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] = [\begin{array}{cc} -2.8052 & -0.5279 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]
\( D \sim P^*W^{*'} \) consider the 1st and 2nd rows of \( D \):
\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] \sim [5.3598] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array} ] \sim [\begin{array}{cc} 2.8809 & 4.5194 \end{array} ] \]
\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] \sim [-2.8052] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array}] \sim [\begin{array}{cc} -1.5078 & -2.3653 \end{array} ] \]
The matrices W and W* are:
W
## [,1] [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432 0.5376
as.matrix(W[, 1])
## [,1]
## [1,] 0.5376
## [2,] 0.8432
PCA on 2 variables may not convince you it is a useful technique, let's have a look on the example of 10 RVs, you may change your mind :)
Since PCA is useful in highly correlated data, we generate 10 RVs with random variances and high correlations among them.
library("MASS")
set.seed(7)
# limit sigma between 2 to 6, mu between -5 to 5
sigma.10 <- runif(10, 2, 6)
mu.10 <- runif(10, -5, 5)
# correlations are set to be high, from 0.9 to 1
corr.data <- runif((10 * 9)/2, 0.9, 1)
corr.mat <- diag(10)
corr.mat[lower.tri(corr.mat)] <- corr.data
corr.mat <- corr.mat %*% t(corr.mat) #LU decomposition
var.mat <- diag(10) * sigma.10
cov.mat <- var.mat %*% corr.mat %*% var.mat
D <- mvrnorm(1000, mu = mu.10, Sigma = cov.mat)
colnames(D) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
We now have 1000 observations of random variables X1, X2, …, X10, the first 6 rows of the observations are:
head(D)
## X1 X2 X3 X4 X5 X6 X7 X8
## [1,] -4.6118 -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,] 3.1745 -2.198 6.5436 0.2722 5.9752 3.1262 -1.500 -14.9713
## [3,] -0.2836 6.053 9.8119 0.4989 5.4662 -0.3006 2.764 0.9408
## [4,] -10.4104 -15.783 -1.2564 -9.9002 -8.9553 -16.8600 -4.604 3.3858
## [5,] 3.9650 -5.453 -1.1502 -6.4966 -7.8201 -11.3640 -3.997 -7.4090
## [6,] -7.3194 -6.590 -0.4325 -6.3764 0.4998 -3.2336 -5.920 -9.4978
## X9 X10
## [1,] 4.446 4.46822
## [2,] 3.341 2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911 -1.26148
## [5,] 3.730 -0.89635
## [6,] 6.476 0.04366
Now, we apply PCA to reduce the dimension of the data
Sigmahat = cov(D)
e = eigen(Sigmahat)
W = cbind(e$vectors)
Let's examine the eigenvalues
e$values
## [1] 691.743 58.670 30.988 16.657 11.989 6.396 4.847 3.054
## [9] 3.007 1.593
With the first 4 eigenvectors, we can explain more than 96.2742 percent of the variation of the data D.
sum(e$values[1:4])/sum(e$values) * 100
## [1] 96.27
Let's compare the first 4 columns in \( P \) and \( W \) be \( P^* \) and \( W^* \)
P = D %*% W
D.hat = P[, 1:4] %*% t(W[, 1:4])
colnames(D.hat) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
Now, let's look at the difference between the orginal data \( D \) and regenerate data \( \hat{D} \)
head(D)
## X1 X2 X3 X4 X5 X6 X7 X8
## [1,] -4.6118 -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,] 3.1745 -2.198 6.5436 0.2722 5.9752 3.1262 -1.500 -14.9713
## [3,] -0.2836 6.053 9.8119 0.4989 5.4662 -0.3006 2.764 0.9408
## [4,] -10.4104 -15.783 -1.2564 -9.9002 -8.9553 -16.8600 -4.604 3.3858
## [5,] 3.9650 -5.453 -1.1502 -6.4966 -7.8201 -11.3640 -3.997 -7.4090
## [6,] -7.3194 -6.590 -0.4325 -6.3764 0.4998 -3.2336 -5.920 -9.4978
## X9 X10
## [1,] 4.446 4.46822
## [2,] 3.341 2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911 -1.26148
## [5,] 3.730 -0.89635
## [6,] 6.476 0.04366
head(D.hat)
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## [1,] -7.51019 -6.290 -5.389 -5.777 -8.319 -13.837 -7.337 -8.501 0.4882
## [2,] 1.83168 1.640 1.991 2.595 3.742 4.632 -2.706 -13.709 -1.4306
## [3,] 4.55053 3.337 2.220 1.886 2.290 3.053 1.767 1.893 -3.1759
## [4,] -12.21744 -9.699 -7.920 -8.310 -11.048 -15.698 -3.805 4.080 1.6485
## [5,] -0.02942 -1.028 -2.337 -3.491 -6.366 -12.421 -6.164 -5.554 -0.9276
## [6,] -7.54581 -5.473 -3.311 -2.639 -2.574 -3.151 -3.451 -8.756 -0.6881
## X10
## [1,] 5.6071
## [2,] 4.6427
## [3,] -9.4474
## [4,] 0.6940
## [5,] 0.6069
## [6,] 2.6988
And let's plot the first 4 eigenvectors
Actually, R has a built-in function 'princomp' for performing PCA, but I think it is worthwhile to learn PCA in brute force programming.
p = princomp(D)
plot(p)
summary(p)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## Standard deviation 26.2879 7.65580 5.56387 4.07930 3.46075 2.527669
## Proportion of Variance 0.8345 0.07078 0.03738 0.02009 0.01446 0.007715
## Cumulative Proportion 0.8345 0.90527 0.94265 0.96274 0.97720 0.984920
## Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 2.200474 1.746596 1.733065 1.261686
## Proportion of Variance 0.005847 0.003684 0.003627 0.001922
## Cumulative Proportion 0.990767 0.994451 0.998078 1.000000
What we have done so far is: We reduced the data dimensions while preserved around 96% of variability the orginal data.
I'll show you how PCA be used in finance area in the coming future. Let's take a break now.