Introduction

Principal Component Analysis (PCA) is a popular technique for deriving a set of low dimensional features from a larget set of variables. However, another popular application of PCA is visualizing higher dimensional data. In this tutorial, we will go over the basics of PCA and apply it to cluster a data set consisting of houses with different features.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised machine learning technique that attempts to derive a set of low-dimensional set of features from a much larger set while still preserving as much variance as possible. Perhaps the two main applications of PCA are

  1. Variable selection
  2. Visualizing High-Dimensional Data

We will be focusing on the visualization part. In this regard, PCA can be thought of as a clustering algorithm not unlike other clustering methods, such as k-means clustering.

To start, consider a random vector of features

\[ \begin{aligned} \mathbf{x} = \begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{pmatrix} \end{aligned} \]

with a population variance-covariance matrix

\[ \begin{aligned} \text{var}(\mathbf{x}) = \Sigma = \begin{pmatrix} \sigma_{1}^{2} & \sigma_{12} & \cdots & \sigma_{1p} \\ \sigma_{21} & \sigma_{2}^{2} & \cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots & \sigma_{p}^{2} \end{pmatrix} \end{aligned} \]

We can represent the set of features \(x_{1}, \ldots, x_{p}\) in a normalized linear combination of the features

\[ \begin{aligned} Z_{1} = \phi_{11}x_{1} + \phi_{2}x_{21} + \ldots + \phi_{p1}x_{p} \end{aligned} \]

The above linear combination of features is called the first principal component, which we will discuss more at length in the next section. Each principal component is a linear combination of the \(p\) features in the data set.

Principal Components and Their Scores

Suppose we have a data set with \(n\) observations with measurements on a set of \(p\) features \(x_{1}, x_{2}, \ldots, x_{p}\) as part of an exploratory data analysis (EDA). As typical of any EDA procedure, generating a scatter plot on one feature against another would be an ideal way of quickly identifying any patterns that may or not be present in the data. In such a case, there are \(\binom{p}{2} = p/(p-1)/2\) scatterplots. However, generating a scatter plot where \(p > 2\) is obviously difficult to generate. Suppose we have a set of features where \(p = 15\), we now have a resulting 105 scatterplots! Even if you were to plot one variable against all other variables and create a scatterplot matrix, such a task would be cumbersome and laborious. Moreover, it would be nearly difficiult, if not impossible, to interpret! Clearly, there must be a better way to graph a data set when \(p\) is large. Fortunately, this is where PCA comes in handy. The goal is to generate a low-dimensional reprsentation of the data that preserves as much of the information as possible (i.e. maximizes variance). For example, if we had a data set with \(p = 15\) features, we can obtain a two-dimensional representation of the data that preserves most of the variance, then we can plot the observations in the resulting low-dimensional space.

The first principal component of a set of features \(x_{1}, x_{2}, \ldots, x_{p}\) is the normalized linear combination of the features

\[ \begin{aligned} Z_{1} = \phi_{11}x_{1} + \phi_{21}x_{2} + \ldots + \phi_{p1}x_{p} \end{aligned} \]

that has the largest variance. When we refer to “normalized,” we mean that \(\sum_{j=1}^{p} \phi_{j1}^{2} = 1\). The elements \(\phi_{11}, \phi_{21}, \ldots , \phi_{p1}\) are the loadings of the first principal component. Together, the loadings comprise the principal component loading vector \(\phi_{1} = (\phi_{11}, \phi_{21}, \ldots, \phi_{p1})^{T}\).

Given a \(p \times n\) data set \(\mathbf{X}\), we need to formulize a method of computing the first principal component. Since we are only interested in variance, we can assume that every variable in the set \(\mathbf{X}\) has ben centered to have a mean \(\mu = 0\). Next, we search for the lienar combination of the sample feature set of the form

\[ \begin{aligned} z_{i1} = \phi_{11}x_{i1} + \phi_{21}x_{i2} + \ldots + \phi_{p1}x_{ip} \end{aligned} \]

that has the largest possible sample variance, subject to \(\sum_{j=1}^{p} \phi_{j1}^{2} = 1\). Mathematically, the first principal component loading vector solves the following optimization problem

\[ \begin{aligned} \max_{\phi_{11}, \ldots, \phi_{p1}}{\Bigg\{\frac{1}{n} \Bigg(\sum_{j=1}^{p}{\phi_{j1}x_{ij}} \Bigg)^{2} \Bigg\}} \quad \text{subject to} \quad \sum_{j=1}^{p}\phi_{j1}^{2} = 1 \end{aligned} \]

Because the mean of the linear combination of features \(z_{11}, z_{21}, \ldots, z_{n1}\) is zero, the objective becomes maximizing the sample variance of the \(n\) values of \(z_{i1}\). We refer to \(z_{11}, \ldots, z_{n1}\) as the scores of the first principal component. The optimization problem above can be solved via eigen decomposition, a commonly-used standard technique in linear algebra.

Once we have found the first principal component \(Z_{1}\), we can find the second principal component \(Z_{2}\). The second principal component is the linear combination of \(x_{1}, \ldots, x_{p}\) that has the maximal variance out of all linear combinations that are uncorrelated with the first principal component \(Z_{1}\). The second principal component takes the exact same form as the first principal component. That is, it is the linear combination

\[ \begin{aligned} z_{i2} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + \ldots + \phi_{p2}x_{ip} \end{aligned} \]

Once we have computed the principal components, we can begin to plot them against each other in order to generate a low-dimensional view of the data. Typically, we plot the first principal component (PC1) against the second principal component (PC2).

A simple algorithm for PCA is as follows

  1. Normalize the Data - This is often necessary if the features in your feature set are measured in different units. A common way of normalizing the features in your data set is to convert them to z-scores.
  2. Calculate the covariance matrix.
  3. Compute the eigen values and vectors.
  4. Re-orient the data.
  5. Plot the data (PC1 against PC2)

Clusutering Housing Data

Let us now put theory into practice. Our data set is housing data which contains many features. Let us first load and examine it.

library(foreign)
homes <- read.dta('/Users/czar.yobero/R/Data Sets/stockton2.dta')
kable(head(homes, n = 10), format = 'html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))
price sqft beds baths age stories vacant
205000 2377 4 3 0 2 1
230000 2325 4 3 0 2 1
117000 1651 3 2 9 2 1
80000 1213 3 1 63 1 0
116500 1800 4 2 17 1 0
97000 1197 3 2 32 1 1
125000 1712 3 2 10 1 0
70000 1032 3 2 33 1 0
121000 1909 3 2 3 1 1
159000 1885 4 2 1 1 0

The column names are pretty self-explanatory. Let’s normalize price and sqft, since they are relatively large numbers compared to the rest of the other feature values.

# normalizing price and sqft to z-scores
homes$price <- scale(homes$price)
homes$sqft <- scale(homes$sqft)

The \(\text{prcomp()}\) function is how we will calculate the princpal components and their scores, and consequently visualize the data.

# this calculates your first principal components
homes.pca <- prcomp(homes[1:7])
homes.pca.df <- data.frame(homes.pca$rotation)
kable(homes.pca.df, format = 'html') %>%
  kable_styling(bootstrap_options = c('hover', 'striped'))
PC1 PC2 PC3 PC4 PC5 PC6 PC7
price -0.0116038 0.5923804 -0.6313033 -0.0899107 0.1673657 0.4621263 -0.0274037
sqft -0.0100850 0.6470531 0.0817893 -0.0869739 0.0381653 -0.7411454 0.1273320
beds -0.0102876 0.2997648 0.6344413 0.2532264 0.5788865 0.3164490 -0.0902569
baths -0.0137342 0.3274115 0.3151956 -0.0907707 -0.6713654 0.2039462 -0.5410175
age 0.9997290 0.0211650 0.0053504 -0.0030552 -0.0012183 0.0047949 -0.0056213
stories -0.0022367 0.1646639 0.2487109 -0.0553828 -0.3692521 0.3005285 0.8254041
vacant -0.0024843 -0.0760811 0.1761160 -0.9533688 0.2198691 0.0711921 -0.0294257

Now let us retrieve the scores and plot the data.

library(ggplot2)
theme_set(theme_bw())

homes.pcaScores <- data.frame(homes.pca$x[, 1:2]) # we only need the first two principal components
ggplot(homes.pcaScores, aes(y = PC1, x = PC2)) + geom_point(col = 'tomato2')

From a set of \(p = 7\) features, we have now plotted our homes data into a low-dimensional scatter plot where essentially \(p = 2\). You can see some clusters within this scatter plot, although they are not too apparent.

Conclusion

Principal Component Analysis derives a low-dimensional feature set from a higher-dimensional feature set while striving to preserve as much information (i.e. variance) as possible. It can be used for feature selection and visualizing higher-dimensional data where the features \(p\) is large. It is an unsupervised learning technique that can be used to identify patterns, clusters, and perhaps any latent information.