Principal Component Analysis - An Introduction with R Implementation

Introduction

Statistical Modeling is a very popular term nowadays. The main task of any type of mathematical modeling is to describe a variable of interest in terms of some other related variables. More specifically, to establish a functional relationship among the variables with proper mathematical justification. The variable which has to be described is called target or response variable and the variable(s) through which the target variable is described is(are) called predictor variable(s). A mathematical model can be expressed follows,

\[ y = f(x) + error \]

where \(f(x)\) can be any function of the predictor \(x\) or of the set of predictors \(\mathbf{x} = (x_1, x_2, \ldots, x_p)'\). Some common functions are,

\(f(x_1, x_2, \ldots, x_p) = \beta_0 + \sum_{i=1}^p \beta_i x_i\)
\(f(x_1, x_2, \ldots, x_p) = \frac{1}{1 + e^{-\beta_0 - \sum_{i=1}^p \beta_i x_i}}\)

etc.

It is very common to think that predictive power of a model increases as we include more and more predictors in the model. But it has a downside as well. It makes the model uninterpretable. Not only that, including large number of predictors in the model increases the chance for some of the predictors to be correlated i.e. they don’t provide additional information about the target variable. It is known as multicollinearity problem. So, using too many predictors can be cumbersome for modeling, this is called curse of dimensionality.

This post is on a very popular dimension reduction technique, known as Principal Component Ananlysis. I will describe the underlying mathematics as well as the step-by-step implementation procedures in R.

What is Principal Component Analysis (PCA) ?

If you are familiar with linear algebra, especially the linear transformation part, then it will be a bit easier to understand the whole matter. But it is not mandatory.

Let me show you first an example. Suppose, we have marks of 200 students both in English and Mathematics. The simple sactter plot for the marks is given below:

Now, the question is which subject should be the better discriminating factor in case of comparing among the students? You probably answer English, as variation along the axis English is higher than that of axis Mathematics. In data analysis, sometimes the most important task is to find the axis along which the data has larger variation.

Now, what about the case if the scatter plot looks like the one given below,

Surely, the larger variation of the data is along a slanted line with some positive slope, say

\[ Maths = \beta \times English \]

which is actually a linear combibation of the two axes with coefficients 1 and \(-\beta\). So, the larger variation of the data is along a new axis (if we rotate the original coordinate system) which a is a linear combination of existing axes.

So, for choosing the discriminating factor, it is enough to consider only a single variable (for the first case) or an appropriate linear combination of the two variables. Hence the dimension reduces from two to one.

PCA tries to identify the axes along which the data varies largely just by rotating the original coordinate system or equivalently taking linear combinations of the original axes.

In case of two dimensional data it is pretty easier to plot the scatter diagram and identify the axis in which variation is largest, but real life data comes with more dimensions, in that case it is not possible to understant all these things graphically rather we have to use some metrics to identify it.

The Excellent mathematics behind PCA

According the last section, PCA’s are the linear combinations of the original axes. Now the task is to compute the coefficients of the linear combinations.

Suppose, initially in your model there are \(p\) predictors, like \(X_1, \ldots, X_p\), each with \(n\) records. So, the data matrix looks like below:

\[ \mathbf{D} = \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1p}\\ x_{21} & x_{22} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \vdots\\ x_{n1} & x_{n2} & \ldots & x_{np}\\ \end{pmatrix} \]

By using PCA, we’ll generate new data matrix \(D^{\star}\) which also has \(n\) records but \(k (k << p)\) variables. It is to be noted that, in this process actually \(p\) new variables are generated, but PCA helps to identify only \(k\) of them along which the data varies most.

Mathematically, PCA uses the linear transformation like below:

\[ \begin{pmatrix} U_1\\ U_2\\ \vdots\\ U_p \end{pmatrix} = \begin{pmatrix} a_{11} & \ldots & a_{1p}\\ a_{21} & \ldots & a_{2p}\\ \vdots & \vdots & \vdots\\ a_{p1} & \ldots & a_{pp}\\ \end{pmatrix} \begin{pmatrix} X_1\\ X_2\\ \vdots\\ X_p \end{pmatrix} \qquad (1) \]

Assumptions of PCA

Suppose the \(p\) new variables generated in the last step are \(U_1, U_2, \ldots, U_p\) There are two conditions behind it,

The new variables will be uncorrelated (or orthogonal)
Largest variation will be along \(U_1\), second largest variation will be along \(U_2\), third largest variation will be along \(U_3\) and so on.

Derivation of first PCA

First of all, I should mention one thing, I denote the variance-covariance matrix of \(X_1,X_2, \ldots, X_p\) by \(\Sigma\) through out this post.

\[ \Sigma = \begin{pmatrix} \sigma_{11}^2 & \ldots & \sigma_{1p}^2\\ \sigma_{21}^2 & \ldots & \sigma_{2p}^2\\ \vdots & \vdots & \vdots\\ \sigma_{p1}^2 & \ldots & \sigma_{pp}^2\\ \end{pmatrix} \]

where \(\sigma_{ii}, 1 \leq i \leq p\) is the variance of \(X_i\) and \(\sigma_{ij}, 1 \leq i,j \leq p, i \neq j\) is the covariance between \(X_i\) and \(X_j\).

According to equation \((1)\),

\[ U_1 = a_{11}X_1 + \ldots + a_{1p}X_p \]

So that,

\[ \begin{eqnarray} Var(U_1) &=& a_{11}^2Var(X_1) + \ldots + a_{1p}^2Var(X_p)\\ &=& a_{11}^2\sigma_1^2 + \ldots + a_{1p}^2\sigma_2^2 + \sum_{k \neq l} Cov(X_k, X_l)\\ &=& \mathbf{a_1'}\Sigma\mathbf{a_1}\quad\text{(note carefully)} \end{eqnarray} \]

Note that variance of \(U_1\) can be increased just by increasing the values of \(a_{11}, \ldots, a_{1p}\). But how large it can be? If we increase the coefficients infinitely, variance will also increase infinitely. For the sake of computational simplicity we have to restrict it. We do this just by putting a constraint \(\sum_{i = 1}^p a_{1i}^2 = \mathbf{a_1'a_1} = 1\) i.e. square of the length of \(\mathbf{a_1}\) is 1.

Now the problem is to maximize \(\mathbf{a_1'}\Sigma\mathbf{a_1}\) w.r.t the constraint \(\mathbf{a_1'a_1} = 1\). To solve this optimization problem we proceed as follows:

\[ L_1 = \mathbf{a_1'}\Sigma\mathbf{a_1} - \lambda_1 \times (\mathbf{a_1'a_1} - 1) \qquad (2) \]

and finally we get \(\lambda_1\) as of of the eigen values of \(\Sigma\) and \(\mathbf{a_1}\) is the corresponding eigen vector.

Similarly, for \(U_2\), we formulate the optimizing equation as,

\[ L_2 = \mathbf{a_2'}\Sigma\mathbf{a_2} - \lambda_2 \times (\mathbf{a_2'a_2} - 1) - m_2 \times (\mathbf{a_1'a_2} - 0) \qquad (3) \]

Again, we’ll get \(m_2 = 0\), \(\lambda_2\) as the second eigen value of \(\Sigma\) and \(\mathbf{a_2}\) as the corresponding eigen vector and so on.

Explained variability by the Principal Components

It can be proved that, \[ Var(U_i) = \lambda_i, 1 \leq i \leq p \] and

\[ \sum_{i=1}^p \sigma_i^2 = \sum_{i=1}^p \lambda_i = \sum_{i=1}^p Var(X_i) \]

So, the proportion of variance captured by \(k^{th}\) PC is \(\frac{\lambda_k}{\sum_{i=1}^p \lambda_i}\).

Algorithm for the computation

I’m assuming there are originally \(p\) predictors in the original model and they are \(X_1, X_2, \ldots, X_p\).

Compute the variance-covariance matrix or correlation matrix of \(X_1, X_2, \ldots, X_p\) i.e. \(\Sigma\).
Perform the eigen-analysis of \(\Sigma\).
Sort the eigen values by decreasing order of magnitude and store them as a set of sorted (eigen value, eigen vector) pair.
Compute expalined variance proportion by the PC’s and also the cumulative sum of proportinal explained variances.
Choose only the first \(k\) eigen values if cumulative sum computed in the last step is very near by 1. This \(k\) PC’s are sufficient to capture almost same amount of variablity generated by the original variables.
Extract all the \(k\) eigen vectors accordingly and make a transformation matrix \(\mathbf{W}\) just by keeping them side by side as columns from left to right. This transformation matrix will be of order \(p \times k\).
Perform the following transformation to generate a data matrix of order \(n \times k\), \[ \begin{eqnarray} \mathbf{D^{\star}} &=& \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1p}\\ x_{21} & x_{22} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \vdots\\ x_{n1} & x_{n2} & \ldots & x_{np}\\ \end{pmatrix} \begin{pmatrix} w_{11} & w_{12} & \ldots & w_{1k}\\ w_{21} & w_{22} & \ldots & w_{2k}\\ \vdots & \vdots & \vdots & \vdots\\ w_{p1} & w_{p2} & \ldots & w_{pk}\\ \end{pmatrix}\\ &=& \begin{pmatrix} y_{11} & y_{12} & \ldots & y_{1k}\\ y_{21} & y_{22} & \ldots & y_{2k}\\ \vdots & \vdots & \vdots & \vdots\\ y_{n1} & y_{n2} & \ldots & y_{nk}\\ \end{pmatrix} \end{eqnarray} \]

Implementation in R

Here we will work with the USArrests data that has four continuous variables with 50 records. Let us make a scatter plot matrix to review the relationships among the variables.

data("USArrests")
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Explore the structure of the data
usarrests <- tbl_df(USArrests)
glimpse(usarrests)

## Observations: 50
## Variables: 4
## $ Murder   (dbl) 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4,...
## $ Assault  (int) 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46,...
## $ UrbanPop (int) 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 6...
## $ Rape     (dbl) 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9,...

Next, we standardize the variables for zero mean and unit standard deviation. This will remove comparability problem among the variables.

normalize <- function(x)
{
  return((x - mean(x))/sd(x))
}

# create normalized variables
usarrests2 <- usarrests %>%
  apply(2, normalize) %>%
  as.data.frame()

Now, we make a scatter plot matrix,

library(GGally)

## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:dplyr':
## 
##     nasa

ggpairs(data = usarrests2, 
        columns = 1:4)

Note that Murder and Assault are highly correlated so also Assault and Rape. Correlated predictors creates problems in prediction. This is known as multicollinearity problems. This is beyond the scope of our discussion.

Next, we compute correlation matrix of the variables and perform an eigen analysis on it.

# correlation matrix
cor.mat <- cor(usarrests2)  
eig <- eigen(cor.mat)

# Extract eigen values and eigen vectors
evals <- eig$values; evecs <- eig$vectors

# sort the eigen values in descending order
evals.sorted <- sort(evals, decreasing = T)

# Compute proportion of explained variances
var.exp <- evals.sorted/sum(evals)

Our next task is to visualize the proportion of explained variance with their cumulative sum. It will help us to identify the number of PC’s to keep for further computation.

barplot(var.exp, ylim = c(0,1), col = 'sandybrown',
        xlab = "Principal Component",
        ylab = "Explained Variances",
        axes = TRUE)
axis(1, c(0.7, 1.9, 3.1, 4.3),
     labels = c('PC1', 'PC2', 'PC3', 'PC4'))
lines(cumsum(var.exp), type = 's', col = 'darkgreen')
legend(x = 2.5, y = 0.5, legend = c('Explained Variance', 'Cumulative Explained Variance'),
       pch = c(15,15), col = c('sandybrown', 'darkgreen'),
       bty = 'n')

Now, it is clear that, only first two PC’s can explain almost \(85\%\) of the total variance. So, we can use two PC’s instead of four original variables.

So, let’s make the transformation matrix now and prepare the new data matrix

# Transformation matrix
trmat <- eig$vectors[,1:2]

# New data
trans.data <- as.matrix(usarrests2) %*% trmat

# Print the partial new data matrix
print(head(trans.data, n = 20))

##              [,1]        [,2]
##  [1,] -0.97566045  1.12200121
##  [2,] -1.93053788  1.06242692
##  [3,] -1.74544285 -0.73845954
##  [4,]  0.13999894  1.10854226
##  [5,] -2.49861285 -1.52742672
##  [6,] -1.49934074 -0.97762966
##  [7,]  1.34499236 -1.07798362
##  [8,] -0.04722981 -0.32208890
##  [9,] -2.98275967  0.03883425
## [10,] -1.62280742  1.26608838
## [11,]  0.90348448 -1.55467609
## [12,]  1.62331903  0.20885253
## [13,] -1.36505197 -0.67498834
## [14,]  0.50038122 -0.15003926
## [15,]  2.23099579 -0.10300828
## [16,]  0.78887206 -0.26744941
## [17,]  0.74331256  0.94880748
## [18,] -1.54909076  0.86230011
## [19,]  2.37274014  0.37260865
## [20,] -1.74564663  0.42335704

Now, let me plot the two new variables.

The above scatter plot shows that two variables are not correlated at all. You can check this using cor() function in R as well. Hence we have reduced a four dimensional data to a two dimensional data without loosing much variance.

Since we are generating new feature variables as a function of old ones, sometimes principal component analysis is called Feature Extraction in statistical modeling theory.