Synthetic Data Analysis

Motivation of PCA

Let us first generate a (centered) data set to explain the theoretical properties of principal components.

The motivation of Principal Component Analysis (PCA) is two fold:

  1. The \(1\)st principal component (PC) is the standardized linear combination (SLC), which has the maximum variance among all SLCs, and the \(j\)-th principal component has the maximum variance among all the SLCs which are uncorrelated to 1st,\(\ldots\), \((j-1)\)-th PCs, \(j=2,\ldots\).

  2. The plane spanned by the \(1\)st \(k\) PCs has the property that if we project the data points on this plane, the average distance of the original points and the corresponding projections is minimum when compared to any other \(k\)-dimensional hyperplane.

First, let us connect these two notions:

  • Consider a plane \(P\), which is passing through \({\bf 0}\). Thus, \(P\) is a subspace. To project the data points on \(P\), we express each data point \({\bf x}\) as \({\bf y}+{\bf z}\) where \({\bf y}\) is on \(P\) and \({\bf z}\) is orthogonal to \(P\), and \({\bf y}\) is the projection of \({\bf x}\) on \(P\).

  • Now if \(B^{3\times 2}=[{\bf b}_{1}, {\bf b}_{2}]\) is an orthonormal basis of \(P\), then \({\bf y}=BB^{\prime} {\bf x}\). Being a point on the plane spanned by \(B\), \({\bf y}=u_{1}{\bf b}_{1}+u_{2}{\bf b}_{2}\), where \(u_{i}={\bf b}_{i}^{\prime} {\bf y}={\bf b}_{i}^{\prime} {\bf x}\), \(i=1,2\), which are also standardized linear combinations (SLCs) of \({\bf x}\).

The figure below shows the plane \(P\).

In the figure above, we have considered the SLC corresponding to the mean of the \(3\) variables (i.e., \({\bf b}_{1}=(1,1,1)^{\prime}/\sqrt{3}\)), and another orthogonal SLC, viz., \(x-y\) (i.e., \({\bf b}_{2}=(1,-1,0)^{\prime}/\sqrt{2}\)).

Next we plot the projected points \({\bf y}\)s, and try to perceive the variability explained by the pair of SLCs visually. (Note that, if we rotate the 3D plot and visualize from a point which is orthogonal to \(P\), then the visible variation of the points will be due to variability in \((u_{1},u_{2})\) only.)

Observe that the variability of the projected points are quite less compared to the variability of the data.

Next, let us consider the first two principal component directions. Observe that the \(2\)-PC plane is located at the middle of the data cloud.

Now consider the projections of the data points on the \(2\)-PC plane.

If we rotate the plot and look into it from a perpendicular direction to the \(2\)-PC plane, then we can see that the variation of the projected (red) points are much higher than the projected points on the arbitrary plane \(P\) considered before.

Further, it is also evident that the average distance of the data points from their projections are much smaller now.


\[\\[1in]\]

Asymptotic properties of the PCs

In the previous section, we considered a random sample of size \(n=500\), and projected the samples on population the 2-PC plane. Next we will see how the sample 2-PC plane differs from sample to sample, and how it converges to the population 2-PC plane as \(n\rightarrow\infty\).

Note that, in this synthetic framework, the population eigenvalues are all distinct and positive.

Sampling fluctuations:

To visualize the sampling fluctuations we consider \(3\) different samples of size \(20\) from the normal population with mean \((0,0)^{\prime}\) and \(\Sigma= VDV^{\prime}\) where \(D=\mathrm{Diag}(10,5,1)\) and \(V\) is an appropriate orthogonal matrix.

The population 2-PC plane is indicated in green, and the \(3\) samples and their corresponding sample 2-PC planes are indicated in red, blue and orange, respectively.

We calculate the sample PCs and superimpose the sample 2-PC planes.

\[\\[1in]\]

Consistency

Next we take 3 choices of \(n\), viz., \(n=20,60\) and \(100\), and see how the sample 2-PC plane converges to population counterpart as \(n\) increases.

The first plot compares the population 2-PC plot, and a sample 2-PC plot for \(n=20\).

The second plot compares the population 2-PC plot, and a sample 2-PC plot for \(n=60\).

The third plot compares the population 2-PC plot, and a sample 2-PC plot for \(n=100\).

Observe that for \(n=100\), the sample 2-PC plane is much closer to the population 2-PC plane at the region around the data points. As both the planes extend up to infinity, there might be other regions where these to planes are quite far apart. However, as the data points are not around those regions, they are not of interest.


\[\\[1in]\]

Real Data Analysis

Iris Data

The Iris flower data is a multivariate data set. The data was collected to quantify the morphologic variation of Iris flowers of three related species, viz., Iris setosa, Iris virginica and Iris versicolor. The data set consists of 50 samples from each of these three species. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The following figure shows two dimensional scatterplots of the data, where Iris setosa, Iris virginica and Iris versicolor are indicated in red, green and blue, respectively.

From the scatter plots, the following features are clearly visible:

  1. The main source of variability in the data is due to the fact that flowers of different populations are combined together.

  2. The variables petal length and petal width are most different across populations.

Next we find the spectral decomposition of the variance covariance matrix \(S_n\) of the Iris data \(X\). Let \(S_n=G\mathrm{Diag}(L)G^{T}\). Then the PC loading matrix \(G\) and vector of eigenvalues are as follows:

Loading matrix
PC-1 PC-2 PC-3 PC-4
Sepal L. 0.3614 -0.6566 0.5820 0.3155
Sepal W. -0.0845 -0.7302 -0.5979 -0.3197
Petal L. 0.8567 0.1734 -0.0762 -0.4798
Petal W. 0.3583 0.0755 -0.5458 0.7537
Eigenvalues
x
4.2282417
0.2426707
0.0782095
0.0238351

The PC-trasformated data is stored in \(Y=\left(X-\bar{\bf x} {\bf 1}^{T}\right)G\).

Observe that, \[PC_{1}=0.36\times \mathrm{Sepal~ L.} -0.08 \times \mathrm{Sepal~ W.} + 0.86\times \mathrm{Petal~ L.} + 0.36 \times \mathrm{Petal~ W.},\] and it explains \(4.23\times 100/\sum_{j} L_{j}=93.46\%\) variation of the data.

Similarly, \[PC_{2}=-0.66\times \mathrm{Sepal~ L.} -0.73 \times \mathrm{Sepal~ W.} + 0.17\times \mathrm{Petal~ L.} + 0.08 \times \mathrm{Petal~ W.},\] and it explains \(5.31\%\) variation of the data.

Plotting the data with respect to \((PC_{1},PC_{2})\) coordinates we get:

Observe that \(PC_{1}\) explains the variation in the flower populations (i.e., the flowers of different populations have different \(PC_{1}\) coordinates on an average). Further, the variable Petal L. has weight \(0.86\) in \(PC_{1}\). Thus, we can conclude that the variation of the data is highest with respect to Petal L. variable. This coinsides with the fact that the Petal L. is one of the most distinguishing features across the flower populations.

The following figure shows the values of \(PC_{1}\), and that of Petal L. of the sample points. Observe that the two plots are almost similar, as Petal L. mostly dominates \(PC_{1}\).

The correlation of \(PC_{1}\) and the variable Petal L. is \(\rho_{1,3}=0.9979\). Thus, \(PC_{1}\) explains more than \(99\%\) of the variation in Petal L. Similarly, it explains \(93\%\) variation in the Petal W., \(80\%\) of the variation in Sepal L., and only \(16\%\) of variation in Sepal W. \(PC_{2}\), on the other hand explains \(68\%\) of variation in Sepal W. This is consistent with the fact that \(PC_{2}\) captures the variability which is not explained by \(PC_{1}\).

\[\\[1in]\]

Boston Housing Data

Boston housing data contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has been used extensively throughout the literature to benchmark algorithms. The dataset has 506 cases, and 14 covariates, which are

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - \(1000(B_{k} - 0.63)^2\) where \(B_k\) is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s

We have previously worked with Boston housing data in the context of regression, where we set MEDV as the response and the remaining covariates as regressors.

In the following analysis we drop the variable CHAS, as it is binary.

Unlike Iris data, the 13 remaining variables differ in scale. Boxplots of the variables are given below:

##      crim zn indus   nox    rm  age    dis rad tax ptratio      b lstat medv
## 1 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
## 2 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
## 3 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
## 4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
## 5 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
## 6 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

From the box plots it is clear that the variable TAX has highest variance, and B has second highest variance among all the variables. Moreover, this variability is due to difference in scales of the variables. Therefore, if we apply PC-transformation to this data, the first few PCs will entrirely be driven by TAX and B.

Loading matrix
PC-1 PC-2 PC-3 PC-4
crim 0.0293 0.0067 -0.0120 -0.0254
zn -0.0436 0.0011 0.6322 -0.7632
indus 0.0283 -0.0049 -0.0884 0.0131
nox 0.0004 0.0000 -0.0018 -0.0007
rm -0.0012 0.0004 0.0050 -0.0064
age 0.0836 -0.0057 -0.7525 -0.6406
dis -0.0066 0.0004 0.0447 -0.0017
rad 0.0450 -0.0086 0.0033 0.0186
tax 0.9494 -0.2927 0.0952 0.0192
ptratio 0.0056 -0.0025 -0.0116 0.0329
b -0.2912 -0.9561 -0.0248 -0.0032
lstat 0.0230 0.0058 -0.0948 -0.0399
medv -0.0255 -0.0088 0.0734 -0.0544
Eigenvalues
x
30910.0128
6250.8142
822.4633
267.2963

From the loading matrix (above), observe that \(PC_{1}\), which explains \(80\%\) of total variability (TV) of the data, is almost entirely driven by TAX (in the SLC the coefficient of TAX is \(0.95\), and that of B is \(-0.29\)), and \(PC_{2}\), which explains \(16\%\) of TV, is almost entitely driven by B. However, a mere change of scales of these two variables may change the entire scenario. Therefore, we standardize the variables of this data set, and then apply PC-transformation.

Loading matrix
PC-1 PC-2 PC-3 PC-4
crim 0.2422 -0.0117 0.4087 -0.0625
zn -0.2455 -0.1118 0.4343 -0.3014
indus 0.3319 0.1160 -0.0876 0.0186
nox 0.3253 0.2589 -0.0980 -0.1934
rm -0.2027 0.5331 0.2477 0.1853
age 0.2971 0.2504 -0.2585 -0.0753
dis -0.2983 -0.3683 0.2399 -0.0234
rad 0.3034 0.0893 0.4145 0.2131
tax 0.3240 0.0602 0.3409 0.1442
ptratio 0.2076 -0.3293 0.0637 0.7045
b -0.1966 -0.0308 -0.3630 0.4009
lstat 0.3114 -0.2458 -0.1126 -0.2885
medv -0.2665 0.4929 0.0699 0.1432
Eigenvalues
x
6.5458499
1.5226618
1.3357904
0.8640037

Now every variable has almost equal weightage in \(PC_{1}\), B has less weightage than others.

How many PCs should we consider? The scree plot and the table of percentage of variation explained by the PCs are given below.

Variation explained
Eigenvalues % explained cumulative % explained
PC1 6.5458499 50.3526914 50.35269
PC2 1.5226618 11.7127828 62.06547
PC3 1.3357904 10.2753105 72.34078
PC4 0.8640037 6.6461825 78.98697
PC5 0.6667516 5.1288582 84.11583
PC6 0.5374569 4.1342835 88.25011
PC7 0.4036395 3.1049196 91.35503
PC8 0.2775037 2.1346438 93.48967
PC9 0.2534452 1.9495784 95.43925
PC10 0.2128616 1.6373970 97.07665
PC11 0.1832646 1.4097276 98.48638
PC12 0.1359778 1.0459834 99.53236
PC13 0.0607934 0.4676412 100.00000

The first three PCs explain \(72\%\) of the total variation. The next table provides the \(\%\) of variation of each variable explained by the first 3 PCs.

Variation explained
PC1 PC2 PC3 TV explained (%)
crim 0.3841135 0.0002092 0.2231218 60.74445
zn 0.3944870 0.0190460 0.2519316 66.54645
indus 0.7212056 0.0205040 0.0102554 75.19650
nox 0.6926610 0.1020919 0.0128212 80.75740
rm 0.2690198 0.4326675 0.0819905 78.36777
age 0.5776919 0.0954678 0.0892449 76.24047
dis 0.5824077 0.2065645 0.0768488 86.58210
rad 0.6026166 0.0121513 0.2294577 84.42255
tax 0.6872189 0.0055205 0.1552696 84.80091
ptratio 0.2820251 0.1650755 0.0054192 45.25199
b 0.2530094 0.0014443 0.1759736 43.04273
lstat 0.6345641 0.0919926 0.0169226 74.34793
medv 0.4648292 0.3699268 0.0065335 84.12894

Observe that, about \(40\%-90\%\) variation of each variable is explained by \(PC_{1}\)-\(PC_{3}\).Thus, it seems reasonable to consider the first three PCs.

Next, we plot the correlations of the covariates with first 2 PCs.

Almost every variable is highly correlated with PC1. PC1 is positively correlated with CRIM, AGE, NOX, RAD, TAX, PTRATIO, LSTAT, INDUS, and negatively correlated with MEDV, RM, B, ZN, DIS. The minimal correlation (in the value) is about \(0.5\). The first PC axis could be interpreted as a quality of life and house indicator (lower is better), as low values of CRIM, NOX, TAX, PTRATIO and high values of MEDV, RM indicate high quality of living. Similarly, the second PC can be interpreted as a social factor.

Principal Component Regression

Finally, we demonstrate the appropriateness of principal component regression (PCR) using this data set. To see that we drop the variable MEDEV from the data, and calculate the PCs.

We split the samples of houses into two parts based on the variable MEDV:

  1. houses with higher prices (with price higher than the median price, indicated in red)

  2. houses with lower prices (with price lower than the median price, indicated in blue).

The following plot explians the association of MEDV with the first two PCs.

Observe in the first plot that almost all the blue (red) plots are in the positive (negative) part of the PC\(_{1}\) axis. This indicates that the first PC has high (negative) correlation with MEDV, as the average price of houses will be high for a location where quality of life is higher.

\[\\[1in]\]

USPS Data

Finally, we demonstrate PC projection with USPS data.

USPS data: The US Postal (USPS) handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes. The digits were down scaled to \(16\times 16\) pixels and 1:1 scaled.

There are 11000 samples of handwritten digits in USPS data. Each sample is \(16\times 16=256\) dimensional, indicating the gray scale index (GSI, ranging from 0:255) of \((x,y)\)-th pixel, i.e., the first variable indicates the GSI of \((1,1)\)-th pixel, second variable indicates that in \((1,2)\)-th pixel, and so on. Further, GSI\(=0\) indicates black, GSI\(=255\) indicates white.

The following figure shows 4 samples from USPS data.

The dimension of USPS data is high, and therefore we apply PCA to reduce dimension of the data in such a way that least information is lost.

Note that samples of each digit corresponds to a separate population. In the following part we only consider samples corresponding to the digit \(5\).

Note that, the variables are in same scale. So, we do not standardize the data.

First we choose an optimal no. of PCs, \(k\).

From the scree plot, it is clear that the first \(50\) PCs explain a large part of the variance. The following table makes this clearer.

Variation explained
Eigenvalues % explained cumulative % explained
PC1 196840.441 12.7347946 12.73479
PC2 156401.493 10.1185553 22.85335
PC3 129618.194 8.3857823 31.23913
PC4 70855.329 4.5840583 35.82319
PC5 66893.686 4.3277558 40.15095
PC6 58473.028 3.7829726 43.93392
PC7 55338.250 3.5801650 47.51408
PC8 48515.359 3.1387510 50.65283
PC9 41523.842 2.6864276 53.33926
PC10 36178.801 2.3406247 55.67989
PC11 35664.586 2.3073570 57.98724
PC12 29132.953 1.8847863 59.87203
PC13 28226.146 1.8261195 61.69815
PC14 26636.820 1.7232964 63.42145
PC15 24795.064 1.6041421 65.02559
PC16 23883.069 1.5451397 66.57073
PC17 21401.286 1.3845782 67.95531
PC18 18605.880 1.2037265 69.15903
PC19 18351.184 1.1872487 70.34628
PC20 16911.933 1.0941349 71.44042
PC21 16139.204 1.0441424 72.48456
PC22 14666.421 0.9488592 73.43342
PC23 14069.703 0.9102539 74.34367
PC24 12801.424 0.8282013 75.17187
PC25 12720.581 0.8229711 75.99484
PC26 11922.122 0.7713139 76.76616
PC27 11221.789 0.7260052 77.49216
PC28 10893.494 0.7047658 78.19693
PC29 10320.463 0.6676929 78.86462
PC30 10030.269 0.6489186 79.51354
PC31 9858.062 0.6377774 80.15132
PC32 9474.849 0.6129851 80.76430
PC33 8755.733 0.5664611 81.33076
PC34 8562.854 0.5539827 81.88475
PC35 8016.228 0.5186181 82.40337
PC36 7634.492 0.4939213 82.89729
PC37 7458.022 0.4825044 83.37979
PC38 7158.446 0.4631230 83.84291
PC39 6837.811 0.4423792 84.28529
PC40 6643.433 0.4298037 84.71510
PC41 6203.069 0.4013140 85.11641
PC42 6182.841 0.4000052 85.51642
PC43 5879.942 0.3804089 85.89683
PC44 5749.885 0.3719947 86.26882
PC45 5658.287 0.3660687 86.63489
PC46 5345.777 0.3458505 86.98074
PC47 5207.048 0.3368753 87.31761
PC48 5051.979 0.3268429 87.64446
PC49 4682.811 0.3029593 87.94742
PC50 4610.168 0.2982596 88.24568

We set \(k=30\), as it explains about \(80\%\) of the total variation. The PC projections of the sample points, defined as \[\hat{\bf x}^{*}=\boldsymbol{\mu}+\sum_{j=1}^{k} {\bf v}_{(j)} {\bf v}_{(j)}^{T} \left( {\bf x}-\boldsymbol{\mu}\right), \] is stored in the \(n\times p\) matrix \(\hat{\bf X}^{*}\).

Note that, \(\hat{\bf x}^{*}\), defined above, is the projection of \({\bf x}\) in a \(k\)-dimensional subspace. However, the point \(\hat{\bf x}^{*}\) itself lies in the ambient space, i.e., in \(p=256\)-dimensional space. So, it is possible to compare the original points \(({\bf x})\) with the projected points \((\hat{\bf x}^{*})\).

The following figure shows four different samples (in upper panel) and their projections (lower panel).

Observe that, even after dropping \(226\) dimensions, the main features of the data are still present in the projected points. So, the digits can be recognized clearly.

\[\\[1in]\]

References

  1. Hardle, W. and Simar, L. (2207). Applied Multivariate Statistical Analysis. Springer Berlin Heidelberg NewYork.