Synthetic Data Analysis

Motivation of PCA

Let us first generate a (centered) data set to explain the theoretical properties of principal components.

The motivation of Principal Component Analysis (PCA) is two fold:

The $1$st principal component (PC) is the standardized linear combination (SLC), which has the maximum variance among all SLCs, and the $j$-th principal component has the maximum variance among all the SLCs which are uncorrelated to 1st,$\ldots$, $(j-1)$-th PCs, $j=2,\ldots$.
The plane spanned by the $1$st $k$ PCs has the property that if we project the data points on this plane, the average distance of the original points and the corresponding projections is minimum when compared to any other $k$-dimensional hyperplane.

First, let us connect these two notions:

Consider a plane $P$, which is passing through ${\bf 0}$. Thus, $P$ is a subspace. To project the data points on $P$, we express each data point ${\bf x}$ as ${\bf y}+{\bf z}$ where ${\bf y}$ is on $P$ and ${\bf z}$ is orthogonal to $P$, and ${\bf y}$ is the projection of ${\bf x}$ on $P$.
Now if $B^{3\times 2}=[{\bf b}_{1}, {\bf b}_{2}]$ is an orthonormal basis of $P$, then ${\bf y}=BB^{\prime} {\bf x}$. Being a point on the plane spanned by $B$, ${\bf y}=u_{1}{\bf b}_{1}+u_{2}{\bf b}_{2}$, where $u_{i}={\bf b}_{i}^{\prime} {\bf y}={\bf b}_{i}^{\prime} {\bf x}$, $i=1,2$, which are also standardized linear combinations (SLCs) of ${\bf x}$.

The figure below shows the plane $P$.

In the figure above, we have considered the SLC corresponding to the mean of the $3$ variables (i.e., ${\bf b}_{1}=(1,1,1)^{\prime}/\sqrt{3}$), and another orthogonal SLC, viz., $x-y$ (i.e., ${\bf b}_{2}=(1,-1,0)^{\prime}/\sqrt{2}$).

Next we plot the projected points ${\bf y}$s, and try to perceive the variability explained by the pair of SLCs visually. (Note that, if we rotate the 3D plot and visualize from a point which is orthogonal to $P$, then the visible variation of the points will be due to variability in $(u_{1},u_{2})$ only.)

Observe that the variability of the projected points are quite less compared to the variability of the data.

Next, let us consider the first two principal component directions. Observe that the $2$-PC plane is located at the middle of the data cloud.

Now consider the projections of the data points on the $2$-PC plane.

If we rotate the plot and look into it from a perpendicular direction to the $2$-PC plane, then we can see that the variation of the projected (red) points are much higher than the projected points on the arbitrary plane $P$ considered before.

Further, it is also evident that the average distance of the data points from their projections are much smaller now.

\[\\[1in]\]

Asymptotic properties of the PCs

In the previous section, we considered a random sample of size $n=500$, and projected the samples on population the 2-PC plane. Next we will see how the sample 2-PC plane differs from sample to sample, and how it converges to the population 2-PC plane as $n\rightarrow\infty$.

Note that, in this synthetic framework, the population eigenvalues are all distinct and positive.

Sampling fluctuations:

To visualize the sampling fluctuations we consider $3$ different samples of size $20$ from the normal population with mean $(0,0)^{\prime}$ and $\Sigma= VDV^{\prime}$ where $D=\mathrm{Diag}(10,5,1)$ and $V$ is an appropriate orthogonal matrix.

The population 2-PC plane is indicated in green, and the $3$ samples and their corresponding sample 2-PC planes are indicated in red, blue and orange, respectively.

We calculate the sample PCs and superimpose the sample 2-PC planes.

\[\\[1in]\]

Consistency

Next we take 3 choices of $n$, viz., $n=20,60$ and $100$, and see how the sample 2-PC plane converges to population counterpart as $n$ increases.

The first plot compares the population 2-PC plot, and a sample 2-PC plot for $n=20$.

The second plot compares the population 2-PC plot, and a sample 2-PC plot for $n=60$.

The third plot compares the population 2-PC plot, and a sample 2-PC plot for $n=100$.

Observe that for $n=100$, the sample 2-PC plane is much closer to the population 2-PC plane at the region around the data points. As both the planes extend up to infinity, there might be other regions where these to planes are quite far apart. However, as the data points are not around those regions, they are not of interest.

\[\\[1in]\]

Real Data Analysis

Iris Data

The Iris flower data is a multivariate data set. The data was collected to quantify the morphologic variation of Iris flowers of three related species, viz., Iris setosa, Iris virginica and Iris versicolor. The data set consists of 50 samples from each of these three species. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The following figure shows two dimensional scatterplots of the data, where Iris setosa, Iris virginica and Iris versicolor are indicated in red, green and blue, respectively.

From the scatter plots, the following features are clearly visible:

The main source of variability in the data is due to the fact that flowers of different populations are combined together.
The variables petal length and petal width are most different across populations.

Next we find the spectral decomposition of the variance covariance matrix $S_n$ of the Iris data $X$. Let $S_n=G\mathrm{Diag}(L)G^{T}$. Then the PC loading matrix $G$ and vector of eigenvalues are as follows:

Loading matrix
	PC-1	PC-2	PC-3	PC-4
Sepal L.	0.3614	-0.6566	0.5820	0.3155
Sepal W.	-0.0845	-0.7302	-0.5979	-0.3197
Petal L.	0.8567	0.1734	-0.0762	-0.4798
Petal W.	0.3583	0.0755	-0.5458	0.7537

Eigenvalues
x
4.2282417
0.2426707
0.0782095
0.0238351

The PC-trasformated data is stored in $Y=\left(X-\bar{\bf x} {\bf 1}^{T}\right)G$.

Observe that, \[PC_{1}=0.36\times \mathrm{Sepal~ L.} -0.08 \times \mathrm{Sepal~ W.} + 0.86\times \mathrm{Petal~ L.} + 0.36 \times \mathrm{Petal~ W.},\] and it explains $4.23\times 100/\sum_{j} L_{j}=93.46\%$ variation of the data.

Similarly, \[PC_{2}=-0.66\times \mathrm{Sepal~ L.} -0.73 \times \mathrm{Sepal~ W.} + 0.17\times \mathrm{Petal~ L.} + 0.08 \times \mathrm{Petal~ W.},\] and it explains $5.31\%$ variation of the data.

Plotting the data with respect to $(PC_{1},PC_{2})$ coordinates we get:

Observe that $PC_{1}$ explains the variation in the flower populations (i.e., the flowers of different populations have different $PC_{1}$ coordinates on an average). Further, the variable Petal L. has weight $0.86$ in $PC_{1}$. Thus, we can conclude that the variation of the data is highest with respect to Petal L. variable. This coinsides with the fact that the Petal L. is one of the most distinguishing features across the flower populations.

The following figure shows the values of $PC_{1}$, and that of Petal L. of the sample points. Observe that the two plots are almost similar, as Petal L. mostly dominates $PC_{1}$.

The correlation of $PC_{1}$ and the variable Petal L. is $\rho_{1,3}=0.9979$. Thus, $PC_{1}$ explains more than $99\%$ of the variation in Petal L. Similarly, it explains $93\%$ variation in the Petal W., $80\%$ of the variation in Sepal L., and only $16\%$ of variation in Sepal W. $PC_{2}$, on the other hand explains $68\%$ of variation in Sepal W. This is consistent with the fact that $PC_{2}$ captures the variability which is not explained by $PC_{1}$.

\[\\[1in]\]

Boston Housing Data

Boston housing data contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has been used extensively throughout the literature to benchmark algorithms. The dataset has 506 cases, and 14 covariates, which are

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - $1000(B_{k} - 0.63)^2$ where $B_k$ is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s

We have previously worked with Boston housing data in the context of regression, where we set MEDV as the response and the remaining covariates as regressors.

In the following analysis we drop the variable CHAS, as it is binary.

Unlike Iris data, the 13 remaining variables differ in scale. Boxplots of the variables are given below:

##      crim zn indus   nox    rm  age    dis rad tax ptratio      b lstat medv
## 1 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
## 2 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
## 3 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
## 4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
## 5 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
## 6 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

From the box plots it is clear that the variable TAX has highest variance, and B has second highest variance among all the variables. Moreover, this variability is due to difference in scales of the variables. Therefore, if we apply PC-transformation to this data, the first few PCs will entrirely be driven by TAX and B.

Loading matrix
	PC-1	PC-2	PC-3	PC-4
crim	0.0293	0.0067	-0.0120	-0.0254
zn	-0.0436	0.0011	0.6322	-0.7632
indus	0.0283	-0.0049	-0.0884	0.0131
nox	0.0004	0.0000	-0.0018	-0.0007
rm	-0.0012	0.0004	0.0050	-0.0064
age	0.0836	-0.0057	-0.7525	-0.6406
dis	-0.0066	0.0004	0.0447	-0.0017
rad	0.0450	-0.0086	0.0033	0.0186
tax	0.9494	-0.2927	0.0952	0.0192
ptratio	0.0056	-0.0025	-0.0116	0.0329
b	-0.2912	-0.9561	-0.0248	-0.0032
lstat	0.0230	0.0058	-0.0948	-0.0399
medv	-0.0255	-0.0088	0.0734	-0.0544

Eigenvalues
x
30910.0128
6250.8142
822.4633
267.2963

From the loading matrix (above), observe that $PC_{1}$, which explains $80\%$ of total variability (TV) of the data, is almost entirely driven by TAX (in the SLC the coefficient of TAX is $0.95$, and that of B is $-0.29$), and $PC_{2}$, which explains $16\%$ of TV, is almost entitely driven by B. However, a mere change of scales of these two variables may change the entire scenario. Therefore, we standardize the variables of this data set, and then apply PC-transformation.

Loading matrix
	PC-1	PC-2	PC-3	PC-4
crim	0.2422	-0.0117	0.4087	-0.0625
zn	-0.2455	-0.1118	0.4343	-0.3014
indus	0.3319	0.1160	-0.0876	0.0186
nox	0.3253	0.2589	-0.0980	-0.1934
rm	-0.2027	0.5331	0.2477	0.1853
age	0.2971	0.2504	-0.2585	-0.0753
dis	-0.2983	-0.3683	0.2399	-0.0234
rad	0.3034	0.0893	0.4145	0.2131
tax	0.3240	0.0602	0.3409	0.1442
ptratio	0.2076	-0.3293	0.0637	0.7045
b	-0.1966	-0.0308	-0.3630	0.4009
lstat	0.3114	-0.2458	-0.1126	-0.2885
medv	-0.2665	0.4929	0.0699	0.1432

Eigenvalues
x
6.5458499
1.5226618
1.3357904
0.8640037

Now every variable has almost equal weightage in $PC_{1}$, B has less weightage than others.

How many PCs should we consider? The scree plot and the table of percentage of variation explained by the PCs are given below.

Variation explained
	Eigenvalues	% explained	cumulative % explained
PC1	6.5458499	50.3526914	50.35269
PC2	1.5226618	11.7127828	62.06547
PC3	1.3357904	10.2753105	72.34078
PC4	0.8640037	6.6461825	78.98697
PC5	0.6667516	5.1288582	84.11583
PC6	0.5374569	4.1342835	88.25011
PC7	0.4036395	3.1049196	91.35503
PC8	0.2775037	2.1346438	93.48967
PC9	0.2534452	1.9495784	95.43925
PC10	0.2128616	1.6373970	97.07665
PC11	0.1832646	1.4097276	98.48638
PC12	0.1359778	1.0459834	99.53236
PC13	0.0607934	0.4676412	100.00000

The first three PCs explain $72\%$ of the total variation. The next table provides the $\%$ of variation of each variable explained by the first 3 PCs.

Variation explained
	PC1	PC2	PC3	TV explained (%)
crim	0.3841135	0.0002092	0.2231218	60.74445
zn	0.3944870	0.0190460	0.2519316	66.54645
indus	0.7212056	0.0205040	0.0102554	75.19650
nox	0.6926610	0.1020919	0.0128212	80.75740
rm	0.2690198	0.4326675	0.0819905	78.36777
age	0.5776919	0.0954678	0.0892449	76.24047
dis	0.5824077	0.2065645	0.0768488	86.58210
rad	0.6026166	0.0121513	0.2294577	84.42255
tax	0.6872189	0.0055205	0.1552696	84.80091
ptratio	0.2820251	0.1650755	0.0054192	45.25199
b	0.2530094	0.0014443	0.1759736	43.04273
lstat	0.6345641	0.0919926	0.0169226	74.34793
medv	0.4648292	0.3699268	0.0065335	84.12894

Observe that, about $40\%-90\%$ variation of each variable is explained by $PC_{1}$-$PC_{3}$.Thus, it seems reasonable to consider the first three PCs.

Next, we plot the correlations of the covariates with first 2 PCs.

Almost every variable is highly correlated with PC1. PC1 is positively correlated with CRIM, AGE, NOX, RAD, TAX, PTRATIO, LSTAT, INDUS, and negatively correlated with MEDV, RM, B, ZN, DIS. The minimal correlation (in the value) is about $0.5$. The ﬁrst PC axis could be interpreted as a quality of life and house indicator (lower is better), as low values of CRIM, NOX, TAX, PTRATIO and high values of MEDV, RM indicate high quality of living. Similarly, the second PC can be interpreted as a social factor.

Principal Component Regression

Finally, we demonstrate the appropriateness of principal component regression (PCR) using this data set. To see that we drop the variable MEDEV from the data, and calculate the PCs.

We split the samples of houses into two parts based on the variable MEDV:

houses with higher prices (with price higher than the median price, indicated in red)
houses with lower prices (with price lower than the median price, indicated in blue).

The following plot explians the association of MEDV with the first two PCs.

Observe in the first plot that almost all the blue (red) plots are in the positive (negative) part of the PC$_{1}$ axis. This indicates that the first PC has high (negative) correlation with MEDV, as the average price of houses will be high for a location where quality of life is higher.

\[\\[1in]\]

USPS Data

Finally, we demonstrate PC projection with USPS data.

USPS data: The US Postal (USPS) handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes. The digits were down scaled to $16\times 16$ pixels and 1:1 scaled.

There are 11000 samples of handwritten digits in USPS data. Each sample is $16\times 16=256$ dimensional, indicating the gray scale index (GSI, ranging from 0:255) of $(x,y)$-th pixel, i.e., the first variable indicates the GSI of $(1,1)$-th pixel, second variable indicates that in $(1,2)$-th pixel, and so on. Further, GSI$=0$ indicates black, GSI$=255$ indicates white.

The following figure shows 4 samples from USPS data.

The dimension of USPS data is high, and therefore we apply PCA to reduce dimension of the data in such a way that least information is lost.

Note that samples of each digit corresponds to a separate population. In the following part we only consider samples corresponding to the digit $5$.

Note that, the variables are in same scale. So, we do not standardize the data.

First we choose an optimal no. of PCs, $k$.

From the scree plot, it is clear that the first $50$ PCs explain a large part of the variance. The following table makes this clearer.

Variation explained
	Eigenvalues	% explained	cumulative % explained
PC1	196840.441	12.7347946	12.73479
PC2	156401.493	10.1185553	22.85335
PC3	129618.194	8.3857823	31.23913
PC4	70855.329	4.5840583	35.82319
PC5	66893.686	4.3277558	40.15095
PC6	58473.028	3.7829726	43.93392
PC7	55338.250	3.5801650	47.51408
PC8	48515.359	3.1387510	50.65283
PC9	41523.842	2.6864276	53.33926
PC10	36178.801	2.3406247	55.67989
PC11	35664.586	2.3073570	57.98724
PC12	29132.953	1.8847863	59.87203
PC13	28226.146	1.8261195	61.69815
PC14	26636.820	1.7232964	63.42145
PC15	24795.064	1.6041421	65.02559
PC16	23883.069	1.5451397	66.57073
PC17	21401.286	1.3845782	67.95531
PC18	18605.880	1.2037265	69.15903
PC19	18351.184	1.1872487	70.34628
PC20	16911.933	1.0941349	71.44042
PC21	16139.204	1.0441424	72.48456
PC22	14666.421	0.9488592	73.43342
PC23	14069.703	0.9102539	74.34367
PC24	12801.424	0.8282013	75.17187
PC25	12720.581	0.8229711	75.99484
PC26	11922.122	0.7713139	76.76616
PC27	11221.789	0.7260052	77.49216
PC28	10893.494	0.7047658	78.19693
PC29	10320.463	0.6676929	78.86462
PC30	10030.269	0.6489186	79.51354
PC31	9858.062	0.6377774	80.15132
PC32	9474.849	0.6129851	80.76430
PC33	8755.733	0.5664611	81.33076
PC34	8562.854	0.5539827	81.88475
PC35	8016.228	0.5186181	82.40337
PC36	7634.492	0.4939213	82.89729
PC37	7458.022	0.4825044	83.37979
PC38	7158.446	0.4631230	83.84291
PC39	6837.811	0.4423792	84.28529
PC40	6643.433	0.4298037	84.71510
PC41	6203.069	0.4013140	85.11641
PC42	6182.841	0.4000052	85.51642
PC43	5879.942	0.3804089	85.89683
PC44	5749.885	0.3719947	86.26882
PC45	5658.287	0.3660687	86.63489
PC46	5345.777	0.3458505	86.98074
PC47	5207.048	0.3368753	87.31761
PC48	5051.979	0.3268429	87.64446
PC49	4682.811	0.3029593	87.94742
PC50	4610.168	0.2982596	88.24568

We set $k=30$, as it explains about $80\%$ of the total variation. The PC projections of the sample points, defined as \[\hat{\bf x}^{*}=\boldsymbol{\mu}+\sum_{j=1}^{k} {\bf v}_{(j)} {\bf v}_{(j)}^{T} \left( {\bf x}-\boldsymbol{\mu}\right), \] is stored in the $n\times p$ matrix $\hat{\bf X}^{*}$.

Note that, $\hat{\bf x}^{*}$, defined above, is the projection of ${\bf x}$ in a $k$-dimensional subspace. However, the point $\hat{\bf x}^{*}$ itself lies in the ambient space, i.e., in $p=256$-dimensional space. So, it is possible to compare the original points $({\bf x})$ with the projected points $(\hat{\bf x}^{*})$.

The following figure shows four different samples (in upper panel) and their projections (lower panel).

Observe that, even after dropping $226$ dimensions, the main features of the data are still present in the projected points. So, the digits can be recognized clearly.

\[\\[1in]\]

References

Hardle, W. and Simar, L. (2207). Applied Multivariate Statistical Analysis. Springer Berlin Heidelberg NewYork.

PCA demonstration

Minerva Mukhopadhyay

31-03-2022

Synthetic Data Analysis

Motivation of PCA

Asymptotic properties of the PCs

Sampling fluctuations:

Consistency

Real Data Analysis

Iris Data

Boston Housing Data

Principal Component Regression

USPS Data

References