If doing research on social issue like homelessness in U.S., could have a \( 3\times 50 \) matrix of data:
Homelessness in the U.S. related to many factors - poverty levels, unemployment rates, median rent, median income, social services [1].
Question: What if I want to consider more than \( p=3 \) factors?
Answer: To plot \( n \) data vectors with \( p\geq 4 \) entries, must select only \( 3 \) entries to view at a time.
Plotting whole matrix of data to gain necessary intuition is no longer possible.
Overview of Presentation:
[1] Theory behind Principal Component Analysis
[2] Applying PCA to a small dataset for each of the 50 states
To maintain most of the “information” in our data, we want to maintain the majority of its variance.
Goal of PCA is to find orthogonal directions (unit vectors) of maximum variance in our data.
Recall our general \( p\times n \) matrix of data, \[ \small{ W=\begin{bmatrix}\mathbf{x}_1&\mathbf{x}_2&\cdots&\mathbf{x}_n\end{bmatrix} } \]
Column vectors are elements of \( \mathbb{R}^p \), \[ \small{ \begin{aligned} &\mathbf{x}_i=\begin{bmatrix}x_{i1}\\\vdots\\x_{ip}\end{bmatrix},&1\leq i\leq n \end{aligned} } \]
Mean of original data is, \[ \small{ \mathbf{\bar{x}}=\frac{1}{n}\sum_{i=1}^n\mathbf{x}_i } \]
Variance of original data is captured by covariance matrix, \[ \small{ S=\frac{1}{n-1}\sum_{i=1}^n(\mathbf{x}_i-\bar{\mathbf{x}}_i)(\mathbf{x}_i-\bar{\mathbf{x}}_i)^T } \]
Covariance matrix is necessarily a \( p\times p \) symmetric matrix.
Variance of \( p \) variates on diagonals, covariance on off-diagonals.
To find first unit \( \mathbf{w}\in\mathbb{R}^p \) pointing in direction of maximum variance:
Formula for Variance [8] \[ \small{ \mathrm{Var}(\mathbf{y})=\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2 } \]
A nice result: \[ \small{ \begin{aligned} \mathrm{Var(\mathbf{y})}&=\frac{1}{n-1}\sum_{i=1}^n\mathbf{w}^T(\mathbf{x}_i-\bar{\mathbf{x}}_i)(\mathbf{x}_i-\bar{\mathbf{x}}_i)^T\mathbf{w}\\ &=\mathbf{w}^T\left(\frac{1}{n-1}\sum_{i=1}^n(\mathbf{x}_i-\bar{\mathbf{x}}_i)(\mathbf{x}_i-\bar{\mathbf{x}}_i)^T\right)\mathbf{w}\\ &=\mathbf{w}^TS\mathbf{w} \end{aligned} } \]
To find the unit \( \mathbf{w} \) that gives the direction of maximum variance in our data, we maximize, \[ \boxed{ \begin{aligned} &\mathbf{w}^TS\mathbf{w},&\left\lVert\mathbf{w}\right\rVert=1 \end{aligned} } \]
for an \( n\times n \) symmetric matrix \( A \).
Important theorem tells how to find, \[ \small{ M=\mathrm{Max}\{\mathbf{x}^TA\mathbf{x}\, | \, \left\lVert\mathbf{x}\right\rVert=1\} } \]
Pertinent to economics, physics, signal processing [4].
( Include picture of Quadratic form here! )
Theorem: Maximizing \( \{\mathbf{w}^TS\mathbf{w}\,| \,\left\lVert\mathbf{w}\right\rVert=1 \} \)
Let \( \{\lambda_1\geq\lambda_2\cdots\geq\lambda_n\} \) be set of eigenvalues of \( S \) in order of descending magnitude, and \( \{\mathbf{u}_1,\mathbf{u}_2\ldots,\mathbf{u}_n\} \) are set of corresponding orthonormal eigenvectors of \( S \).
Then, \[ \small{ M=\mathrm{Max}\{\mathbf{w}^TS\mathbf{w} \, | \, \left\lVert\mathbf{w}\right\rVert\}=\lambda_1 } \]
\[ \small{ \lambda_1=\mathbf{u}_1^TS\mathbf{u}_1 } \]
Conclusion: Direction of maximum variance is given by unit eigenvector of \( S \), corresponding to largest eigenvalue of \( S \).
For \( k=2,\ldots,n \) the maximum value of \( \mathbf{w}^TS\mathbf{w} \) where, \[ \small{ \begin{aligned} &\mathbf{w}^T\mathbf{w}=1,&\mathbf{w}^T\mathbf{u}_1=0&,\ldots,&\mathbf{w}^T\mathbf{u}_{k-1}=0 \end{aligned} } \]
is \( \lambda_k \) and occurs when \( \mathbf{w}=\mathbf{u}_k \) [4].
Main Idea: Orthonormal eigenvectors of \( S \) are data's orthogonal directions of maximum variance,i.e., principal components.
To find Principal Components, use SVD, \[ \small{ \begin{aligned} \boxed{A=U\Sigma V^T} \end{aligned} } \]
Singular values \( \sigma_1,\ldots,\sigma_n \) are square roots of eigenvalues of \( A^TA \), \[ \small{ \sigma_1,\ldots,\sigma_n=\sqrt{\lambda_1},\cdots,\sqrt{\lambda_n} } \]latex where \( A^TA\mathbf{v}_i=\lambda_i\mathbf{v}_i \).
Left singular vectors of \( A \) are columns of \( U \)
Right singular vectors of \( A \) are columns of \( V \) and orthononormal eigenvectors of \( A^TA \),
\[ \small{ \begin{aligned} &\mathbf{v}_1,\ldots,\mathbf{v_n}\, & A^TA\mathbf{v}_i=\lambda_i\mathbf{v}_i \end{aligned} } \]
If we consider data \( W \) in mean-deviation form, \[ \small{ B=\begin{bmatrix}\mathbf{x}_1-\bar{\mathbf{x}}&\mathbf{x}_2-\bar{\mathbf{x}}&\cdots&\mathbf{x}_n-\bar{\mathbf{x}}\end{bmatrix} } \]
Apply SVD to \( A=\frac{1}{\sqrt{n-1}}B^T \), so that we get eigenvalues, eigenvectors of, \[ \small{ \begin{aligned} A^TA&=\left(\frac{1}{\sqrt{n-1}}B^T\right)^T\left(\frac{1}{\sqrt{n-1}}B\right)\\ &=\frac{1}{\sqrt{n-1}}BB^T\\ &=S \end{aligned} } \]
Thus, apply SVD to \( \frac{1}{\sqrt{n-1}}B^T \), \[ \small{ \frac{1}{\sqrt{n-1}}B^T=U\Sigma V^T } \] gives eigenvalues, eigenvectors of \( S \).
Singular values of \( \frac{1}{\sqrt{n-1}}B^T \) are square roots of eigenvalues of \( S \).
Left singular vectors are columns of \( V \) and orthonormal eigenvectors of \( S \).
Thus, left singular vectors are principal components of \( S \).
Sources: U.S. Census Bureau and U.S. Department of Housing and Urban Development.
Data: \( n=50 \) vectors of form, \[ \small{ \mathbf{x}_i=\begin{bmatrix}h\\cb\\p\end{bmatrix} } \]
\( h=\mathrm{percent\,of\,state\,population\,that\,is\,homeless} \)
\( cb=\mathrm{percent\,of\,state\,population\,that\,is\,cost\,burdened} \)
\( p=\mathrm{percent\,of\,state\,population\,in\,deep\,poverty} \)
svd() built-in function in R to find principal components of data,[1] "d="
[1] 4.10582391 1.23541736 0.09071855
[1] "Eigenvalues of S="
[1] 16.857790020 1.526256047 0.008229856
[1] "Total Variance"
[1] 18.39228
[1] "First six rows of u="
[,1] [,2] [,3]
[1,] 0.11619489 0.14915260 -0.11649125
[2,] -0.12755129 -0.21413139 0.15892807
[3,] 0.08573686 0.17879062 0.03870884
[4,] -0.04074753 0.16857778 -0.09683053
[5,] 0.21843789 -0.03495111 0.14053457
[6,] 0.04246217 -0.05934145 0.17601419
[1] "v="
[,1] [,2] [,3]
[1,] 0.00948655 -0.008718115 0.999916996
[2,] 0.99333641 -0.114778510 -0.010424853
[3,] 0.11485987 0.993352852 0.007571169
[1] "% total variance explained by PC1="
[1] 91.6569
[1] "% total variance explained by PC2="
[1] 8.298353
[1] "% total variance explained by PC3="
[1] 0.04474626
Can create homelessness index based off our data's first principal component, \[ \small{ \mathbf{v}_1=\begin{bmatrix}0.00949\\0.99334\\0.99334\end{bmatrix} } \]
Index provides quick scalar quantification of “strength” of each state's relationship to factors related to homelessness.
\[ \small{ h_{\mathrm{index}} = 0.00949\hat{h}+0.99334\hat{cb}+0.99334\hat{p} } \] where \( \hat{h},\hat{cb},\hat{p} \) are the values of our data in mean-deviation form.
[1] Chris Glynn, Thomas H. Byrne, Dennis P. Culhane. “Inflection points in community-level homeless rates.” The Annals of Applied Statistics, 15(2) 1037-1053 June 2021. https://doi.org/10.1214/20-AOAS1414
[2] “2007 - 2021 Point-in-Time Estimates by State”, PIT and HIC Data Since 2007, US Department of Housing and Urban Development, 2022,
[3] Jolliffe IT, Cadima J. 2016 Principal component analysis: a review and recent developments.Phil. Trans. R. Soc. A 374: 20150202. http://dx.doi.org/10.1098/rsta.2015.0202
[4] Lay, David C., et al. Linear Algebra and Its Applications. Sixth ed., Pearson, 2022. U.S. Census Bureau.
[5] “DECENNIALCD1132010 .” P1: Census Bureau Table, United States Census Bureau, 2010, https://tinyurl.com/h8nzjpmv. Accessed 27 Nov. 2022.
[6] American Community Survey. “ACSDP1Y2010.” DP04: Census Bureau Table, United States Census Bureau, 2010, https://tinyurl.com/y2xzxt2r . Accessed 27 Nov. 2022.
[7] American Community Survey. “ACSST1Y2010 .” S1701: Poverty Status in the Past 12 Months, United States Census Bureau, 2010, https://tinyurl.com/5etsm7xm. Accessed 27 Nov. 2022.
[8] Li, Xiaodong. “STA135 Lecture 2 : Sample Mean Vector and Sample Covariance Matrix.” STA135. Davis, California, https://www.stat.ucdavis.edu/~xdgli/Xiaodong_Li_Teaching_files/135Note2.pdf. Accessed 24 Oct. 2022.
[9] Henderson, Thomas C. “A Geometric Interpretation of the Covariance Matrix.” CS 4640 Fall Semester 2019. https://www.cs.utah.edu/~tch/CS4640F2019/resources/A%20geometric%20interpretation%20of%20the%20covariance%20matrix.pdf. Accessed 30 Nov. 2022.
[10] “What Is ‘Deep Poverty’?” Center for Poverty and Inequality Research, University of California, Davis, 18AD, https://poverty.ucdavis.edu/faq/what-deep-poverty.
[11] “Rental Burdens: Rethinking Affordability Measures: HUD User.” Rental Burdens: Rethinking Affordability Measures | HUD USER, PD&R Edge, https://www.huduser.gov/portal/pdredge/pdr_edge_featd_article_092214.html.
[12] “Principal Component Analysis from Scratch in Python.” AskPython, AskPython, 19 Oct. 2020, https://www.askpython.com/python/examples/principal-component-analysis.
[13] Newsom, Jason T. “Principal Component Analysis.” Psy 523/623 Structural Equation Modeling, Spring 2020. https://web.pdx.edu/~newsomj/semclass/ho_pca.pdf. Accessed 30 Nov. 2022.
[14] “Homelessness: A State of Emergency.” Homelessness: A State of Emergency, National Alliance to End Homelessness, 18 Oct. 2018, https://endhomelessness.org/resource/homelessness-a-state-of-emergency/.
[15] West, Mary, and Karin Gepp. “Maslow's Hierarchy of Needs Pyramid: Uses and Criticism.” Medical News Today, MediLexicon International, 28 July 2022, https://www.medicalnewstoday.com/articles/maslows-hierarchy-of-needs.
[16] US Census Bureau. “Poverty Thresholds.” Census.gov, United States Census Bureau, 13 Sept. 2022, https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-poverty-thresholds.html.
[17] Chartier, Timothy P. When Life Is Linear: From Computer Graphics to Bracketology. Mathematical Association of America, 2015.
[18]RGB Image https://datagenetics.com/blog/august32013/index.html, Accessed 12/2/22
for an \( n\times n \) symmetric matrix \( A \).
Important theorem tells how to find, \[ M=\mathrm{Max}\{\mathbf{x}^TA\mathbf{x}\, | \, \left\lVert\mathbf{x}\right\rVert=1\} \]
Pertinent to economics, physics, signal processing [4].
( Include picture of Quadratic form here! )
Theorem: Finding \( \begin{aligned}M=\mathrm{Max}\{\mathbf{w}^TS\mathbf{w} \, \, | \, \left\lVert\mathbf{w}\right\rVert=1\}\end{aligned} \):
By the Spectral Theorem of Symmetric Matrices [4], can orthogonally diagonalize \( S \): \[ \begin{aligned} S&=PDP^{-1}\\ &=\begin{bmatrix}\mathbf{u}_1&\cdots&\mathbf{u}_n\end{bmatrix}\begin{bmatrix}\lambda_1&\cdots&0\\\vdots&\ddots&\vdots\\0&\cdots&\lambda_n\end{bmatrix}\begin{bmatrix}\mathbf{u}_1\\\vdots\\\mathbf{u}_n\end{bmatrix} \end{aligned} \]
\( \lambda_1\geq\lambda_2\geq\ldots\geq\lambda_n \) are eigenvalues of \( S \)
\( \mathbf{u}_1,\ldots,\mathbf{u}_n \) are corresponding orthonormal eigenvectors of \( S \)
\[ \begin{aligned} &S=PDP^{-1} &D=P^{-1}SP \end{aligned} \]
Perform orthogonal change of variable, \[ \begin{aligned} &\mathbf{w}=P\mathbf{y}, &P^T=P^{-1} \end{aligned} \] Then, quadratic form becomes, \[ \begin{aligned} \mathbf{w}^TS\mathbf{w}&=(P\mathbf{y})^TS(P\mathbf{y})\\ &=\mathbf{y}^T(P^TSP)\mathbf{y}\\ &=\mathbf{y}^TD\mathbf{y} \end{aligned} \]
Without loss of generality, suppose \( S \) is a \( 3\times 3 \) matrix and we can easily derive what \( M=\mathrm{Max}\{\mathbf{w}^TS\mathbf{w}\, | \, \left\lVert\mathbf{w}\right\rVert=1\} \) will be.
If \( S \) is a \( 3\times 3 \) matrix, \[ \begin{aligned} \mathbf{y}^TD\mathbf{y}&=\begin{bmatrix}y_1&y_2&y_3\end{bmatrix}\begin{bmatrix}\lambda_1&0&0\\0&\lambda_2&0\\0&0&\lambda_3\end{bmatrix}\begin{bmatrix}y_1\\y_2\\y_3\end{bmatrix}\\ &=\lambda_1y_1^2+\lambda_2y_2^2+\lambda_2y_3^2 \end{aligned} \]
Since \[ \lambda_1\geq\lambda_2\geq\lambda_3 \]
We have, \[ \begin{aligned} \mathbf{y}^TD\mathbf{y}&=\lambda_1y_1^2+\lambda_2y_2^2+\lambda_3y_3^2\\ &\leq\lambda_1y_1^2+\lambda_1y_2^2+\lambda_2y_3^2\\ \end{aligned} \]
Note, \[ \begin{aligned} \mathbf{y}^TD\mathbf{y}&=\lambda_1y_1^2+\lambda_2y_2^2\lambda_3y_3^2\\ &\leq\lambda_1y_1^2+\lambda_1y_2^2+\lambda_2y_3^2\\ &=\lambda_1(y_1^2+y_2^2+y_3^2)\\ &=\lambda_1\left\lVert\mathbf{y}\right\rVert^2\\ &=\lambda_1 \end{aligned} \]
For \( \mathbf{y}=\begin{bmatrix}1\\0\\0\end{bmatrix} \), \[ \begin{aligned} \mathbf{y}^TD\mathbf{y}&=\begin{bmatrix}1&0&0\end{bmatrix}\begin{bmatrix}\lambda_1&0&0\\0&\lambda_2&0\\0&0&\lambda_3\end{bmatrix}\begin{bmatrix}1\\0\\0\end{bmatrix}\\ &=\lambda_1\\ \end{aligned} \]
Thus, \( \mathbf{y}^TD\mathbf{y}=\mathrm{Max}\{\mathbf{y}^TD\mathbf{y} \, | \, \left\lVert\mathbf{y}\right\rVert=1\}=\lambda_1 \) when \( \mathbf{y}=\mathbf{e}_1 \).
For \( \mathbf{y}=\mathbf{e}_1 \), \[ \begin{aligned} \mathbf{w}&=P\mathbf{y}\\ &=\begin{bmatrix}\mathbf{u}_1&\mathbf{u}_2&\mathbf{u}_3\end{bmatrix}\begin{bmatrix}1\\0\\0\end{bmatrix}\\ &=\mathbf{u}_1 \end{aligned} \]
Thus, the unit \( \mathbf{w} \) that maximizes \( \mathbf{w}^TS\mathbf{w} \) is the unit eigenvector of \( S \) corresponding to its largest eigenvalue.
SVD answers the question of how much a matrix “stretches” a unit vector \( \mathbf{x} \).
Note: the same unit vector \( \mathbf{x} \) that maximizes \( \left\lVert A\mathbf{x}\right\rVert \), also maximizes,
\[ \begin{aligned} \left\lVert A\mathbf{x}\right\rVert^2&=(A\mathbf{x})^T(A\mathbf{x})\\ &=\mathbf{x}^TA^TA\mathbf{x}\\ &=\mathbf{x}(A^TA)\mathbf{x} \end{aligned} \]
Know from previous theorem that maximum of \( \mathbf{x}(A^TA)\mathbf{x} \) given by greatest eigenvalue of \( A^TA \)
Maximum occurs when \( \mathbf{x}=\mathbf{v}_1 \) where \( A^TA\mathbf{v}_1=\lambda_1\mathbf{v}_1 \)
\( A \) stretches \( \mathbf{v}_1 \) the most, by amount \( \lambda_1 \), then in mutually orthogonal directions by amounts given by eigenvalues of \( A^TA \) in order of decreasing magnitude
\[ \begin{aligned} \frac{1}{\sqrt{n-1}}BB^T&=\frac{1}{n-1}\begin{bmatrix}\mathbf{x}_1-\bar{\mathbf{x}}&\mathbf{x}_2-\bar{\mathbf{x}}&\cdots&\mathbf{x}_n-\bar{\mathbf{x}}\end{bmatrix}\begin{bmatrix}\mathbf{x}_1-\bar{\mathbf{x}}\\\mathbf{x}_2-\bar{\mathbf{x}}\\\vdots\\\mathbf{x}_n-\bar{\mathbf{x}}\end{bmatrix}\\ &=\frac{1}{n-1}\sum_{i=1}^n\begin{bmatrix}x_{1i}-\bar{x}_1\\\vdots\\x_{pi}-\bar{x}_p\end{bmatrix}\begin{bmatrix}x_{1i}-\bar{x}_1&\cdots&x_{pi}-\bar{x}_p\end{bmatrix}\\ &=\frac{1}{n-1}\sum_{i=1}^n(\mathbf{x}_i-\mathbf{\bar{x}})(\mathbf{x}_i-\mathbf{\bar{x}})^T\\ &=S \end{aligned} \]
(2.0, 10.0)
(-0.5, 0.5)
(40.0, 80.0)
plt.show()