class: center, middle, inverse, title-slide .title[ # Multivariate Analysis ] .subtitle[ ## Introduction & PCA ] .author[ ### Stéphane Dray ] .date[ ### 2024-01-25 ] --- $$ %center text/code \newcommand{\bcenter}{\begin{center}} \newcommand{\ecenter}{\end{center}} $$ $$ \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \newcommand{\tr}{\hspace{-0.05cm}^{\top}\hspace{-0.05cm}}% transpose d'une matrice \newcommand{\mb}[1]{\mathbf{#1}} \newcommand{\pt}{\mathsmaller \bullet}% petit point pour les indices $$ $$ \newcommand{\triplet}[3]{ % pour ecrire un triplet dans le texte $\left ( #1, #2, #3 \right )$ } $$ $$ \newcommand{\sqnorm}[2]{ %norme au carré \lVert #1 \rVert^2_{#2} } $$ $$ \newcommand{\norm}[2]{ %norme \lVert #1 \rVert_{#2} } $$ --- class: center, inverse # Introduction --- ## Multivariate data We consider cases where several variables are measured for a number of individuals .pull-left[ <img src="img/onetable.png" width="156" style="display: block; margin: auto;" /> ] .pull-right[ `\(n\)` statistical units x `\(p\)` variables Examples in ecology * individuals x traits * species x traits * sites x species * sites x environment ] ```r library(ade4) data(doubs) names(doubs$env) ``` ``` ## [1] "dfs" "alt" "slo" "flo" "pH" "har" "pho" "nit" "amm" "oxy" "bdo" ``` --- ## Univariate analysis .pull-left[ ```r summary(doubs$env$slo) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.099 1.831 2.565 2.758 3.390 6.176 ``` ```r mean(doubs$env$slo) ``` ``` ## [1] 2.757733 ``` ```r var(doubs$env$slo) ``` ``` ## [1] 1.16724 ``` ] .pull-right[ ```r hist(doubs$env$slo) ``` <img src="fig/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] --- ## Bivariate analysis .pull-left[ ```r cov(doubs$env$slo, doubs$env$nit) ``` ``` ## [1] -93.28123 ``` ```r cor(doubs$env$slo, doubs$env$nit) ``` ``` ## [1] -0.6108798 ``` ] .pull-right[ ```r plot(doubs$env$slo, doubs$env$nit) ``` <img src="fig/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ] --- ## Pairwise analysis ```r pairs(doubs$env) ``` <img src="fig/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> --- ## Pairwise analysis ```r round(cor(doubs$env), 2) ``` ``` ## dfs alt slo flo pH har pho nit amm oxy bdo ## dfs 1.00 -0.94 -0.76 0.95 0.00 0.70 0.48 0.75 0.41 -0.51 0.40 ## alt -0.94 1.00 0.76 -0.87 -0.04 -0.74 -0.44 -0.76 -0.38 0.36 -0.34 ## slo -0.76 0.76 1.00 -0.72 -0.27 -0.65 -0.40 -0.61 -0.35 0.46 -0.32 ## flo 0.95 -0.87 -0.72 1.00 0.02 0.70 0.39 0.61 0.29 -0.36 0.25 ## pH 0.00 -0.04 -0.27 0.02 1.00 0.09 -0.08 -0.05 -0.12 0.18 -0.15 ## har 0.70 -0.74 -0.65 0.70 0.09 1.00 0.36 0.51 0.29 -0.38 0.34 ## pho 0.48 -0.44 -0.40 0.39 -0.08 0.36 1.00 0.80 0.97 -0.72 0.89 ## nit 0.75 -0.76 -0.61 0.61 -0.05 0.51 0.80 1.00 0.80 -0.63 0.64 ## amm 0.41 -0.38 -0.35 0.29 -0.12 0.29 0.97 0.80 1.00 -0.72 0.89 ## oxy -0.51 0.36 0.46 -0.36 0.18 -0.38 -0.72 -0.63 -0.72 1.00 -0.84 ## bdo 0.40 -0.34 -0.32 0.25 -0.15 0.34 0.89 0.64 0.89 -0.84 1.00 ``` --- ## Multivariate analysis Avoid pairwise analysis to provide a global summary of the full data set. The objective is to answer simultaneously both questions: - what are the main similarities and differences between the individuals ? - what are the main relationships between the variables ? Multivariate analysis allows to: - summarize linear relationships between variables - identify structures among individuals - reduce the number of variables before new analyses - replace original variables by new synthetic ones --- class: center, inverse # Geometric perspective --- # Two geometric views <img src="img/viewpoints.png" width="640" style="display: block; margin: auto;" /> .pull-left[ In `\(\mathbb{R}^p\)`, what are the main similarities and differences between the individuals ? ] .pull-right[ In `\(\mathbb{R}^n\)`, what are the main relationships between the variables ? ] --- ## A 3D example (space of individuals) .pull-left[ ```r tab <- doubs$env[, c(3, 8, 10)] color <- rainbow(30) pairs(tab, col = color) ``` <img src="fig/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ ```r source("../../R/3D-utils.R") plot3d(tab, type = "s", col = color) ```
] --- ## Change the viewpoint <img src="img/chameau.png" width="1767" style="display: block; margin: auto;" /> --- ## Change the viewpoint
.center[ What is the best viewpoint? ] --- ## Scale the data ``` ## Warning: 'rgl.bbox' is deprecated. ## Use 'bbox3d' instead. ## See help("Deprecated") ```
--- ## Axes of main variation Axes of main variations describe directions where the projections of individuals are the most scattered ``` ## Warning: 'rgl.bbox' is deprecated. ## Use 'bbox3d' instead. ## See help("Deprecated") ```
--- ## New system of axes Lastly, we can use the principal axes as a new system of coordinates and represent the data in this new system ``` ## Warning: 'rgl.bbox' is deprecated. ## Use 'bbox3d' instead. ## See help("Deprecated") ```
--- ## Dimension reduction Only the first few axes (axes of main variation) can be used to describe the main structures of the data. Very useful when a big number of variables/individuals are considered. ```r s.label(pca$li) ``` <img src="fig/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- ## Space of variables The same approach can be used to search for the best representation of variables. .pull-left[ ``` ## Warning: 'rgl.bbox' is deprecated. ## Use 'bbox3d' instead. ## See help("Deprecated") ```
] .pull-right[ <img src="fig/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> ] --- ## Multivariate analysis in short - Data transformation (e.g., centering, scaling) - Best viewpoint (rotation, projection) - Summarize (dimension reduction) Multivariate methods seek for small dimension hyperspaces (few axes) where the representations of individuals and variables are as close as possible to the original ones. To achieve these goals, we need to define some ways to compute distances: * `\(\mb{Q}\)`, a `\(p \times p\)` positive symmetric matrix, used as an inner product in `\(\mathbb{R}^p\)` and thus allows to measure distances between the `\(n\)` individuals * `\(\mb{D}\)`, a `\(n \times n\)` positive symmetric matrix, used as an inner product in `\(\mathbb{R}^n\)` and thus allows to measure relationships between the `\(p\)` variables. All methods consider these different steps but differ in the transformation of the data ( `\(\mb{X}\)` ), metrics ( `\(\mb{Q}\)` and `\(\mb{D}\)` ) and thus on the mathematical criteria that are maximized --- class: center, inverse # Principal component analysis --- ## Data - PCA can be applied when the data table contains only quantitative variables - Data are usually centered and can be scaled - Two metrics (scalar products) are defined to compute distances in the spaces of variables and individuals * `\(\mb{Q} = \mb{I}_p\)` allows to measure distances between the `\(n\)` individuals in `\(\mathbb{R}^p\)`. * `\(\mb{D} = \frac{1}{n} \mb{I}_n\)` allows to measure relationships between the `\(p\)` variables in `\(\mathbb{R}^n\)`. --- ## From geometry to statistics Several geometric operations translate into statistical concepts: * the mean is computed by a scalar product and corresponds to a Euclidean projection `$$\textrm{m}(\mathbf{x})= \left\langle \mathbf{x}|\mathbf{1}_n\right\rangle_{\frac{1}{n}\mathbf{I}_n}$$` * the variance is equal to the squared norm of the centred vector `\(\mathbf{x}^*\)` `$$\textrm{var}(\mathbf{x})= \sqnorm{\mathbf{x} - \textrm{m}(\mathbf{x}) \mathbf{1}_n}{\frac{1}{n}\mathbf{I}_n} = \sqnorm{\mathbf{x}^*}{\frac{1}{n}\mathbf{I}_n}$$` * the covariance is a scalar product between two centred vectors `$$\textrm{cov}(\mathbf{x},\mathbf{y})=\left\langle \mathbf{x}^*|\mathbf{y}^*\right\rangle_{\frac{1}{n}\mathbf{I}_n}= \norm{\mathbf{x}}{\frac{1}{n}\mathbf{I}_n} \norm{\mathbf{y}}{\frac{1}{n}\mathbf{I}_n} \cos(\theta_{\mathbf{xy}})$$` * the correlation is a cosine of the angle formed by the two vectors `$$\textrm{cor}(\mathbf{x},\mathbf{y}) = \cos(\theta_{\mathbf{xy}})$$` --- ## Geometry for individuals <img src="fig/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> We aim to find a vector `\(\mb{a}\)` of `\(\mathbb{R}^p\)` maximizing the sum of the norms of the projections of the rows of `\(\mb{X}\)`: `$$\sum_{i=1}^n d_i \sqnorm{\mb{P}_{\mb{a}} \mb{X}_i}{\mb{Q}}$$` where `\(d_i\)` is the `\(i\)`-th diagonal element of `\(\mathbf{D}\)` and `\(\mathbf{P}_{\mb{a}}\)` is the projection operator onto the vector `\(\mb{a}\)` --- ## Projection on a (normed) vector The projection step is obtained by `$$\mathbf{P}_{\mb{a}} \mathbf{X}_i = \frac{ \left \langle \mathbf{a}|\mathbf{X}_i \right \rangle_{\mb{Q}} }{ \left \langle \mathbf{a} | \mathbf{a} \right \rangle_{\mb{Q}} } \mathbf{a}$$` If `\(\mathbf{a}\)` is `\(\mathbf{Q}\)`-normed ( `\(\mb{a}\tr\mb{Qa}=1\)` ), it simplifies to `$$\mathbf{P}_{\mb{a}} \mathbf{X}_i = \left \langle \mathbf{a}|\mathbf{X}_i \right \rangle_{\mb{Q}} \mathbf{a}$$` So that the maximized quantity can be rewritten as: `$$\sum_{i=1}^n d_i \sqnorm{ \mb{P}_{\mb{a}} \mb{X}_i}{\mb{Q}} = {\sum_{i=1}^n d_i \left \langle \mathbf{a}|\mathbf{X}_i \right \rangle^2_{\mb{Q}}} = \sqnorm{\mb{XQa}}{\mb{D}} = \textrm{var}(\mb{XQa})$$` --- ## Diagonalization We denote `\(\lambda\)` as the maximum possible value: `$$\lambda = \sqnorm{\mb{XQa}}{\mb{D}} = \mb{a}\tr\mb{QX}\tr\mb{DXQa}$$` As `\(\mb{a}\)` is `\(\mb{Q}\)`-normed, we can write: `$$(\mb{a}\tr\mb{Qa}) \lambda = \mb{a}\tr\mb{QX}\tr\mb{DXQa}$$` and it follows that: `$$\lambda \mb{a} = \mb{X}\tr\mb{DXQa}$$` This corresponds to matrix diagonalization. Hence, the best axis can be identified as the eigenvector ( `\(\mb{a}\)` ) associated to the highest eigenvalue ( `\(\lambda\)` ) of `\(\mb{X}\tr\mb{DXQ}\)` Other axes can be obtained and correspond to the next eigenvectors/eigenvalues. They maximize the same quantity but should be orthogonal to the previous ones. --- ## PCA in the space of individuals As `\(\mb{D} = \frac{1}{n} \mb{I}_n\)` and `\(\mb{Q} = \mb{I}_p\)`, we have: `$$\lambda \mb{a} = \frac{1}{n}\mb{X}\tr\mb{Xa}$$` * If data have been centered, this corresponds to the diagonalization of the covariance matrix. * If data have been centered and scaled, this corresponds to the diagonalization of the correlation matrix. * In both cases, we have the following interpretations: - geometric: PCA seeks for an axis ( `\(\mb{a}\)` ), called the **principal axis**, on which individuals are projected ( `\(\mb{XQa}\)` ) so that the points are the most scattered ( `\(\sqnorm{\mb{XQa}}{\mb{D}}\)` ). - statistical: PCA seeks for coefficients for variables ( `\(\mb{a}\)` ) to compute a score for individuals ( `\(\mb{XQa}\)` ) with maximal variance ( `\(\textrm{var}(\mb{XQa})\)` ). --- ## Summary <img src="fig/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" /> The last plot corresponds to the PCA factorial map and is usually called a **distance biplot** as it best preserves the distances between individuals. - individuals are represented by `\(\mb{XQa}\)` - variables are represented by `\(\mb{a}\)` - if more variables (here 2) are considered, only the first few axes can be kept so that dimension reduction is performed. The first axes preserves the most important part of information (maximized projected inertia/variance). --- ## Space of variables In `\(\mathbb{R}^n\)`, we can follow the same rationale for variables than for individuals in `\(\mathbb{R}^p\)`. <img src="fig/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- ## PCA in the space of individuals `$$\lambda \mb{b} = \frac{1}{n}\mb{XX}\tr\mb{b}$$` * geometric interpretation: PCA seeks for an axis ( `\(\mb{b}\)` ), called the **principal component** on which variables are projected ( `\(\mb{X}\tr\mb{Db}\)` ) so that they are the most scattered ( `\(\sqnorm{\mb{X}\tr\mb{Db}}{\mb{Q}}\)` ) or collinear. * statistical interpretation: - If data have been centered, PCA seeks for a principal component ( `\(\mb{b}\)` ). Variables are represented by their covariances with the component ( `\(\textrm{cov}(\mathbf{x}_j,\mathbf{b})\)` ) whose sum of squares is maximized ( `\(\sum_{j=1}^{p}\textrm{cov}^2(\mathbf{x}_j,\mathbf{b})\)` ). - If data have been scaled, PCA seeks for a principal component ( `\(\mb{b}\)` ). Variables are represented by their correlations with the component ( `\(\textrm{cor}(\mathbf{x}_j,\mathbf{b})\)` ) whose sum of squares is maximized ( `\(\sum_{j=1}^{p}\textrm{cor}^2(\mathbf{x}_j,\mathbf{b})\)` ). --- ## PCA with ade4 ```r args(dudi.pca) ``` ``` ## function (df, row.w = rep(1, nrow(df))/nrow(df), col.w = rep(1, ## ncol(df)), center = TRUE, scale = TRUE, scannf = TRUE, nf = 2) ## NULL ``` * `df` is a `data.frame` with the data * `row.w` and `col.w` are optional vectors of weights * `center` and `scale` define the standardization of the data * `scannf` and `nf` allow to set the number of dimensions to interpret ```r pca.s <- dudi.pca(doubs$env, scannf = FALSE) ``` --- ## Eigenvalues ```r library(adegraphics) screeplot(pca.s) ``` <img src="fig/unnamed-chunk-23-1.svg" style="display: block; margin: auto;" /> --- ## Inertia statistics ```r summary(pca.s) ``` ``` ## Class: pca dudi ## Call: dudi.pca(df = doubs$env, scannf = FALSE) ## ## Total inertia: 11 ## ## Eigenvalues: ## Ax1 Ax2 Ax3 Ax4 Ax5 ## 6.3216 2.2316 1.0042 0.5007 0.3752 ## ## Projected inertia (%): ## Ax1 Ax2 Ax3 Ax4 Ax5 ## 57.469 20.287 9.129 4.552 3.411 ## ## Cumulative projected inertia (%): ## Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5 ## 57.47 77.76 86.89 91.44 94.85 ## ## (Only 5 dimensions (out of 11) are shown) ``` --- ## Graphical representations As we have *two* analyses (individuals and variables spaces), two representations can be defined: * **distance biplot** where `\(\mb{A}\)` and `\(\mathbf{XQA}\)` are superimposed. * **correlation biplot** where `\(\mb{B}\)` and `\(\mb{X}\tr\mb{DB}\)` are superimposed. .pull-left[ ```r biplot(pca.s) ``` <img src="fig/unnamed-chunk-25-1.svg" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ ```r biplot(pca.s, permute = TRUE) ``` <img src="fig/unnamed-chunk-26-1.svg" width="65%" style="display: block; margin: auto;" /> ] --- ## To scale or not to scale Scaling should be performed when we do not want that differences in variances affect the results ```r pca.c <- dudi.pca(doubs$env, scannf = FALSE, scale = FALSE) pca.s <- dudi.pca(doubs$env, scannf = FALSE, scale = TRUE) ``` .pull-left[ <img src="fig/unnamed-chunk-28-1.svg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/unnamed-chunk-29-1.svg" width="70%" style="display: block; margin: auto;" /> ] In this case, we must scale the data as differences in variances are mainly due to differences in units --- class: center, inverse # Conclusions --- ## PCA as a particular case in the duality diagram theory `$$\mathbf{X} \mathbf{Q} \mathbf{X}\tr \mathbf{D} \mathbf{B} = \mathbf{B} \bm{\Lambda}$$` `$$\mathbf{X}\tr\mathbf{D}\mathbf{X} \mathbf{Q} \mathbf{A} = \mathbf{A} \bm{\Lambda}$$` * `\(\mathbf{B}\)` contains the principal components ( `\(\mathbf{B}\tr\mathbf{D}\mathbf{B}=\mathbf{I}_r\)` ). * `\(\mathbf{A}\)` contains the principal axis ( `\(\mathbf{A}\tr\mathbf{Q}\mathbf{A}=\mathbf{I}_r\)` ). * `\(\mb{L}=\mb{X}\mb{Q}\mb{A}\)` contains the row scores (projection of the rows of `\(\mb{X}\)` onto the principal axes) * `\(\mb{C}=\mb{X}\tr\mb{D}\mb{B}\)` contains the column scores (projection of the columns of `\(\mb{X}\)` onto the principal components) Maximization of: `\(Q(\mathbf{a})=\mathbf{a}\tr\mathbf{Q}\tr\mathbf{X}\tr \mathbf{DXQa} = \lambda\)` and `\(S(\mathbf{b})=\mathbf{b}\tr\mathbf{D}\tr\mathbf{XQX}\tr\mathbf{Db}=\lambda\)` `$$\left\langle {\mathbf{XQa}} | \mathbf{k} \right\rangle _\mathbf{D}=\left\langle {\mathbf{X}^t\mathbf{Db}} | \mathbf{a} \right\rangle _\mathbf{Q} = \sqrt{\lambda}$$` --- ## Available methods Different definitions of a statistical triplet correspond to different methods. Several are available in `ade4` | Function name | Analysis name | |----------------|---------------------------------------| | dudi.pca | Principal component analysis | | dudi.pco | Principal coordinate analysis | | dudi.coa | Correspondence analysis | | dudi.acm | Multiple correspondence analysis | | dudi.dec | Decentered correspondence analysis | | dudi.fca | Fuzzy correspondence analysis | | dudi.fpca | Fuzzy PCA | | dudi.mix | Mixed nalysis | | dudi.hillsmith | Hill-Smith analysis | | dudi.nsc | Non-symmetric correspondence analysis |