Let us first generate a (centered) data set to explain the theoretical properties of principal components.
The motivation of Principal Component Analysis (PCA) is two fold:
The \(1\)st principal component (PC) is the standardized linear combination (SLC), which has the maximum variance among all SLCs, and the \(j\)-th principal component has the maximum variance among all the SLCs which are uncorrelated to 1st,\(\ldots\), \((j-1)\)-th PCs, \(j=2,\ldots\).
The plane spanned by the \(1\)st \(k\) PCs has the property that if we project the data points on this plane, the average distance of the original points and the corresponding projections is minimum when compared to any other \(k\)-dimensional hyperplane.
First, let us connect these two notions:
Consider a plane \(P\), which is passing through \({\bf 0}\). Thus, \(P\) is a subspace. To project the data points on \(P\), we express each data point \({\bf x}\) as \({\bf y}+{\bf z}\) where \({\bf y}\) is on \(P\) and \({\bf z}\) is orthogonal to \(P\), and \({\bf y}\) is the projection of \({\bf x}\) on \(P\).
Now if \(B^{3\times 2}=[{\bf b}_{1}, {\bf b}_{2}]\) is an orthonormal basis of \(P\), then \({\bf y}=BB^{\prime} {\bf x}\). Being a point on the plane spanned by \(B\), \({\bf y}=u_{1}{\bf b}_{1}+u_{2}{\bf b}_{2}\), where \(u_{i}={\bf b}_{i}^{\prime} {\bf y}={\bf b}_{i}^{\prime} {\bf x}\), \(i=1,2\), which are also standardized linear combinations (SLCs) of \({\bf x}\).
The figure below shows the plane \(P\).
In the figure above, we have considered the SLC corresponding to the mean of the \(3\) variables (i.e., \({\bf b}_{1}=(1,1,1)^{\prime}/\sqrt{3}\)), and another orthogonal SLC, viz., \(x-y\) (i.e., \({\bf b}_{2}=(1,-1,0)^{\prime}/\sqrt{2}\)).
Next we plot the projected points \({\bf y}\)s, and try to perceive the variability explained by the pair of SLCs visually. (Note that, if we rotate the 3D plot and visualize from a point which is orthogonal to \(P\), then the visible variation of the points will be due to variability in \((u_{1},u_{2})\) only.)
Observe that the variability of the projected points are quite less compared to the variability of the data.
Next, let us consider the first two principal component directions. Observe that the \(2\)-PC plane is located at the middle of the data cloud.
Now consider the projections of the data points on the \(2\)-PC plane.
If we rotate the plot and look into it from a perpendicular direction to the \(2\)-PC plane, then we can see that the variation of the projected (red) points are much higher than the projected points on the arbitrary plane \(P\) considered before.
Further, it is also evident that the average distance of the data points from their projections are much smaller now.
\[\\[1in]\]
The Iris flower data is a multivariate data set. The data was collected to quantify the morphologic variation of Iris flowers of three related species, viz., Iris setosa, Iris virginica and Iris versicolor. The data set consists of 50 samples from each of these three species. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The following figure shows two dimensional scatterplots of the data, where Iris setosa, Iris virginica and Iris versicolor are indicated in red, green and blue, respectively.
From the scatter plots, the following features are clearly visible:
The main source of variability in the data is due to the fact that flowers of different populations are combined together.
The variables petal length and petal width are most different across populations.
Next we find the spectral decomposition of the variance covariance matrix \(S_n\) of the Iris data \(X\). Let \(S_n=G\mathrm{Diag}(L)G^{T}\). Then the PC loading matrix \(G\) and vector of eigenvalues are as follows:
| PC-1 | PC-2 | PC-3 | PC-4 | |
|---|---|---|---|---|
| Sepal L. | 0.3614 | -0.6566 | 0.5820 | 0.3155 |
| Sepal W. | -0.0845 | -0.7302 | -0.5979 | -0.3197 |
| Petal L. | 0.8567 | 0.1734 | -0.0762 | -0.4798 |
| Petal W. | 0.3583 | 0.0755 | -0.5458 | 0.7537 |
| x |
|---|
| 4.2282417 |
| 0.2426707 |
| 0.0782095 |
| 0.0238351 |
The PC-trasformated data is stored in \(Y=\left(X-\bar{\bf x} {\bf 1}^{T}\right)G\).
Observe that, \[PC_{1}=0.36\times \mathrm{Sepal~ L.} -0.08 \times \mathrm{Sepal~ W.} + 0.86\times \mathrm{Petal~ L.} + 0.36 \times \mathrm{Petal~ W.},\] and it explains \(4.23\times 100/\sum_{j} L_{j}=93.46\%\) variation of the data.
Similarly, \[PC_{2}=-0.66\times \mathrm{Sepal~ L.} -0.73 \times \mathrm{Sepal~ W.} + 0.17\times \mathrm{Petal~ L.} + 0.08 \times \mathrm{Petal~ W.},\] and it explains \(5.31\%\) variation of the data.
Plotting the data with respect to \((PC_{1},PC_{2})\) coordinates we get:
Observe that \(PC_{1}\) explains the variation in the flower populations (i.e., the flowers of different populations have different \(PC_{1}\) coordinates on an average). Further, the variable Petal L. has weight \(0.86\) in \(PC_{1}\). Thus, we can conclude that the variation of the data is highest with respect to Petal L. variable. This coinsides with the fact that the Petal L. is one of the most distinguishing features across the flower populations.
The following figure shows the values of \(PC_{1}\), and that of Petal L. of the sample points. Observe that the two plots are almost similar, as Petal L. mostly dominates \(PC_{1}\).
The correlation of \(PC_{1}\) and the variable Petal L. is \(\rho_{1,3}=0.9979\). Thus, \(PC_{1}\) explains more than \(99\%\) of the variation in Petal L. Similarly, it explains \(93\%\) variation in the Petal W., \(80\%\) of the variation in Sepal L., and only \(16\%\) of variation in Sepal W. \(PC_{2}\), on the other hand explains \(68\%\) of variation in Sepal W. This is consistent with the fact that \(PC_{2}\) captures the variability which is not explained by \(PC_{1}\).
\[\\[1in]\]
Boston housing data contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has been used extensively throughout the literature to benchmark algorithms. The dataset has 506 cases, and 14 covariates, which are
We have previously worked with Boston housing data in the context of regression, where we set MEDV as the response and the remaining covariates as regressors.
In the following analysis we drop the variable CHAS, as it is binary.
Unlike Iris data, the 13 remaining variables differ in scale. Boxplots of the variables are given below:
## crim zn indus nox rm age dis rad tax ptratio b lstat medv
## 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
## 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
## 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
## 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
## 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
## 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
From the box plots it is clear that the variable TAX has highest variance, and B has second highest variance among all the variables. Moreover, this variability is due to difference in scales of the variables. Therefore, if we apply PC-transformation to this data, the first few PCs will entrirely be driven by TAX and B.
| PC-1 | PC-2 | PC-3 | PC-4 | |
|---|---|---|---|---|
| crim | 0.0293 | 0.0067 | -0.0120 | -0.0254 |
| zn | -0.0436 | 0.0011 | 0.6322 | -0.7632 |
| indus | 0.0283 | -0.0049 | -0.0884 | 0.0131 |
| nox | 0.0004 | 0.0000 | -0.0018 | -0.0007 |
| rm | -0.0012 | 0.0004 | 0.0050 | -0.0064 |
| age | 0.0836 | -0.0057 | -0.7525 | -0.6406 |
| dis | -0.0066 | 0.0004 | 0.0447 | -0.0017 |
| rad | 0.0450 | -0.0086 | 0.0033 | 0.0186 |
| tax | 0.9494 | -0.2927 | 0.0952 | 0.0192 |
| ptratio | 0.0056 | -0.0025 | -0.0116 | 0.0329 |
| b | -0.2912 | -0.9561 | -0.0248 | -0.0032 |
| lstat | 0.0230 | 0.0058 | -0.0948 | -0.0399 |
| medv | -0.0255 | -0.0088 | 0.0734 | -0.0544 |
| x |
|---|
| 30910.0128 |
| 6250.8142 |
| 822.4633 |
| 267.2963 |
From the loading matrix (above), observe that \(PC_{1}\), which explains \(80\%\) of total variability (TV) of the data, is almost entirely driven by TAX (in the SLC the coefficient of TAX is \(0.95\), and that of B is \(-0.29\)), and \(PC_{2}\), which explains \(16\%\) of TV, is almost entitely driven by B. However, a mere change of scales of these two variables may change the entire scenario. Therefore, we standardize the variables of this data set, and then apply PC-transformation.
| PC-1 | PC-2 | PC-3 | PC-4 | |
|---|---|---|---|---|
| crim | 0.2422 | -0.0117 | 0.4087 | -0.0625 |
| zn | -0.2455 | -0.1118 | 0.4343 | -0.3014 |
| indus | 0.3319 | 0.1160 | -0.0876 | 0.0186 |
| nox | 0.3253 | 0.2589 | -0.0980 | -0.1934 |
| rm | -0.2027 | 0.5331 | 0.2477 | 0.1853 |
| age | 0.2971 | 0.2504 | -0.2585 | -0.0753 |
| dis | -0.2983 | -0.3683 | 0.2399 | -0.0234 |
| rad | 0.3034 | 0.0893 | 0.4145 | 0.2131 |
| tax | 0.3240 | 0.0602 | 0.3409 | 0.1442 |
| ptratio | 0.2076 | -0.3293 | 0.0637 | 0.7045 |
| b | -0.1966 | -0.0308 | -0.3630 | 0.4009 |
| lstat | 0.3114 | -0.2458 | -0.1126 | -0.2885 |
| medv | -0.2665 | 0.4929 | 0.0699 | 0.1432 |
| x |
|---|
| 6.5458499 |
| 1.5226618 |
| 1.3357904 |
| 0.8640037 |
Now every variable has almost equal weightage in \(PC_{1}\), B has less weightage than others.
How many PCs should we consider? The scree plot and the table of percentage of variation explained by the PCs are given below.
| Eigenvalues | % explained | cumulative % explained | |
|---|---|---|---|
| PC1 | 6.5458499 | 50.3526914 | 50.35269 |
| PC2 | 1.5226618 | 11.7127828 | 62.06547 |
| PC3 | 1.3357904 | 10.2753105 | 72.34078 |
| PC4 | 0.8640037 | 6.6461825 | 78.98697 |
| PC5 | 0.6667516 | 5.1288582 | 84.11583 |
| PC6 | 0.5374569 | 4.1342835 | 88.25011 |
| PC7 | 0.4036395 | 3.1049196 | 91.35503 |
| PC8 | 0.2775037 | 2.1346438 | 93.48967 |
| PC9 | 0.2534452 | 1.9495784 | 95.43925 |
| PC10 | 0.2128616 | 1.6373970 | 97.07665 |
| PC11 | 0.1832646 | 1.4097276 | 98.48638 |
| PC12 | 0.1359778 | 1.0459834 | 99.53236 |
| PC13 | 0.0607934 | 0.4676412 | 100.00000 |
The first three PCs explain \(72\%\) of the total variation. The next table provides the \(\%\) of variation of each variable explained by the first 3 PCs.
| PC1 | PC2 | PC3 | TV explained (%) | |
|---|---|---|---|---|
| crim | 0.3841135 | 0.0002092 | 0.2231218 | 60.74445 |
| zn | 0.3944870 | 0.0190460 | 0.2519316 | 66.54645 |
| indus | 0.7212056 | 0.0205040 | 0.0102554 | 75.19650 |
| nox | 0.6926610 | 0.1020919 | 0.0128212 | 80.75740 |
| rm | 0.2690198 | 0.4326675 | 0.0819905 | 78.36777 |
| age | 0.5776919 | 0.0954678 | 0.0892449 | 76.24047 |
| dis | 0.5824077 | 0.2065645 | 0.0768488 | 86.58210 |
| rad | 0.6026166 | 0.0121513 | 0.2294577 | 84.42255 |
| tax | 0.6872189 | 0.0055205 | 0.1552696 | 84.80091 |
| ptratio | 0.2820251 | 0.1650755 | 0.0054192 | 45.25199 |
| b | 0.2530094 | 0.0014443 | 0.1759736 | 43.04273 |
| lstat | 0.6345641 | 0.0919926 | 0.0169226 | 74.34793 |
| medv | 0.4648292 | 0.3699268 | 0.0065335 | 84.12894 |
Observe that, about \(40\%-90\%\) variation of each variable is explained by \(PC_{1}\)-\(PC_{3}\).Thus, it seems reasonable to consider the first three PCs.
Next, we plot the correlations of the covariates with first 2 PCs.
Almost every variable is highly correlated with PC1. PC1 is positively correlated with CRIM, AGE, NOX, RAD, TAX, PTRATIO, LSTAT, INDUS, and negatively correlated with MEDV, RM, B, ZN, DIS. The minimal correlation (in the value) is about \(0.5\). The first PC axis could be interpreted as a quality of life and house indicator (lower is better), as low values of CRIM, NOX, TAX, PTRATIO and high values of MEDV, RM indicate high quality of living. Similarly, the second PC can be interpreted as a social factor.
Finally, we demonstrate the appropriateness of principal component regression (PCR) using this data set. To see that we drop the variable MEDEV from the data, and calculate the PCs.
We split the samples of houses into two parts based on the variable MEDV:
houses with higher prices (with price higher than the median price, indicated in red)
houses with lower prices (with price lower than the median price, indicated in blue).
The following plot explians the association of MEDV with the first two PCs.
Observe in the first plot that almost all the blue (red) plots are in the positive (negative) part of the PC\(_{1}\) axis. This indicates that the first PC has high (negative) correlation with MEDV, as the average price of houses will be high for a location where quality of life is higher.
\[\\[1in]\]
Finally, we demonstrate PC projection with USPS data.
USPS data: The US Postal (USPS) handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes. The digits were down scaled to \(16\times 16\) pixels and 1:1 scaled.
There are 11000 samples of handwritten digits in USPS data. Each sample is \(16\times 16=256\) dimensional, indicating the gray scale index (GSI, ranging from 0:255) of \((x,y)\)-th pixel, i.e., the first variable indicates the GSI of \((1,1)\)-th pixel, second variable indicates that in \((1,2)\)-th pixel, and so on. Further, GSI\(=0\) indicates black, GSI\(=255\) indicates white.
The following figure shows 4 samples from USPS data.
The dimension of USPS data is high, and therefore we apply PCA to reduce dimension of the data in such a way that least information is lost.
Note that samples of each digit corresponds to a separate population. In the following part we only consider samples corresponding to the digit \(5\).
Note that, the variables are in same scale. So, we do not standardize the data.
First we choose an optimal no. of PCs, \(k\).
From the scree plot, it is clear that the first \(50\) PCs explain a large part of the variance. The following table makes this clearer.
| Eigenvalues | % explained | cumulative % explained | |
|---|---|---|---|
| PC1 | 196840.441 | 12.7347946 | 12.73479 |
| PC2 | 156401.493 | 10.1185553 | 22.85335 |
| PC3 | 129618.194 | 8.3857823 | 31.23913 |
| PC4 | 70855.329 | 4.5840583 | 35.82319 |
| PC5 | 66893.686 | 4.3277558 | 40.15095 |
| PC6 | 58473.028 | 3.7829726 | 43.93392 |
| PC7 | 55338.250 | 3.5801650 | 47.51408 |
| PC8 | 48515.359 | 3.1387510 | 50.65283 |
| PC9 | 41523.842 | 2.6864276 | 53.33926 |
| PC10 | 36178.801 | 2.3406247 | 55.67989 |
| PC11 | 35664.586 | 2.3073570 | 57.98724 |
| PC12 | 29132.953 | 1.8847863 | 59.87203 |
| PC13 | 28226.146 | 1.8261195 | 61.69815 |
| PC14 | 26636.820 | 1.7232964 | 63.42145 |
| PC15 | 24795.064 | 1.6041421 | 65.02559 |
| PC16 | 23883.069 | 1.5451397 | 66.57073 |
| PC17 | 21401.286 | 1.3845782 | 67.95531 |
| PC18 | 18605.880 | 1.2037265 | 69.15903 |
| PC19 | 18351.184 | 1.1872487 | 70.34628 |
| PC20 | 16911.933 | 1.0941349 | 71.44042 |
| PC21 | 16139.204 | 1.0441424 | 72.48456 |
| PC22 | 14666.421 | 0.9488592 | 73.43342 |
| PC23 | 14069.703 | 0.9102539 | 74.34367 |
| PC24 | 12801.424 | 0.8282013 | 75.17187 |
| PC25 | 12720.581 | 0.8229711 | 75.99484 |
| PC26 | 11922.122 | 0.7713139 | 76.76616 |
| PC27 | 11221.789 | 0.7260052 | 77.49216 |
| PC28 | 10893.494 | 0.7047658 | 78.19693 |
| PC29 | 10320.463 | 0.6676929 | 78.86462 |
| PC30 | 10030.269 | 0.6489186 | 79.51354 |
| PC31 | 9858.062 | 0.6377774 | 80.15132 |
| PC32 | 9474.849 | 0.6129851 | 80.76430 |
| PC33 | 8755.733 | 0.5664611 | 81.33076 |
| PC34 | 8562.854 | 0.5539827 | 81.88475 |
| PC35 | 8016.228 | 0.5186181 | 82.40337 |
| PC36 | 7634.492 | 0.4939213 | 82.89729 |
| PC37 | 7458.022 | 0.4825044 | 83.37979 |
| PC38 | 7158.446 | 0.4631230 | 83.84291 |
| PC39 | 6837.811 | 0.4423792 | 84.28529 |
| PC40 | 6643.433 | 0.4298037 | 84.71510 |
| PC41 | 6203.069 | 0.4013140 | 85.11641 |
| PC42 | 6182.841 | 0.4000052 | 85.51642 |
| PC43 | 5879.942 | 0.3804089 | 85.89683 |
| PC44 | 5749.885 | 0.3719947 | 86.26882 |
| PC45 | 5658.287 | 0.3660687 | 86.63489 |
| PC46 | 5345.777 | 0.3458505 | 86.98074 |
| PC47 | 5207.048 | 0.3368753 | 87.31761 |
| PC48 | 5051.979 | 0.3268429 | 87.64446 |
| PC49 | 4682.811 | 0.3029593 | 87.94742 |
| PC50 | 4610.168 | 0.2982596 | 88.24568 |
We set \(k=30\), as it explains about \(80\%\) of the total variation. The PC projections of the sample points, defined as \[\hat{\bf x}^{*}=\boldsymbol{\mu}+\sum_{j=1}^{k} {\bf v}_{(j)} {\bf v}_{(j)}^{T} \left( {\bf x}-\boldsymbol{\mu}\right), \] is stored in the \(n\times p\) matrix \(\hat{\bf X}^{*}\).
Note that, \(\hat{\bf x}^{*}\), defined above, is the projection of \({\bf x}\) in a \(k\)-dimensional subspace. However, the point \(\hat{\bf x}^{*}\) itself lies in the ambient space, i.e., in \(p=256\)-dimensional space. So, it is possible to compare the original points \(({\bf x})\) with the projected points \((\hat{\bf x}^{*})\).
The following figure shows four different samples (in upper panel) and their projections (lower panel).
Observe that, even after dropping \(226\) dimensions, the main features of the data are still present in the projected points. So, the digits can be recognized clearly.
\[\\[1in]\]
The AMSST dataset contains 15048 rows representing that many spatial coordinates at a resolution of 0.25deg by 0.25deg and 38 column representing the years 1982-2019. At the \(s\)-th spatial location and \(t\)-th time point, it provides the annual mean surface temperature.
We first consider the SVD of the data and obtain the empirical
orthogonal functions (EOFs) and time-PCs.
The first PC explains more than \(50\%\) variability of the data.
## [1] "The MSE of the rank-1 approximation is 0.153641585580673"
Next, we consider the number of principal components \(K\) in such a way that at least \(80\%\) of the total variability in the data is explained.
## [1] "The MSE of the rank-5 approximation is 0.103208089192851"
Let us now understand the main sources of variability, i.e., the major directions in which the data is most variable.