Principal components analysis

Indirect gradiaent analysis. Summarize, in a low-dimensional space, the variance in a multivariate scatter of points. Provides an overview of linear relationships between your objects and variables.

Uses:

  • data exploration - trends, groupings, key variables, outliers
  • Many variables, few objects - collapse into few components for further analyses

The first PC is constructed in the direction of maximum scatter/variability. Subsequent PCs are constructed in the same manner and are orthogonal (have no correlation with other PCs). Original variables are rescaled and can be represented in biplot. If few PCs capture most (70-90%) of the variance in original scatter, PCA represents variability well.


Get data:

data(iris)  
head(iris)  
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Look at relationships between variables:

plot(iris)

PCA results return a set of eigenvalues, principal components (PCs), and loadings (correlations between variables and PCs). Eigenvalues inform on the on data variability, PCs tell the structure of the observations, and loading indicate relationships between variables and associations with PCs.

iris.pca<-princomp(iris%>%select(-Species))
summary(iris.pca)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3      Comp.4
## Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700
## Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
## Cumulative Proportion  0.9246187 0.97768521 0.99478782 1.000000000
##PCs
iris.pca$loadings
## 
## Loadings:
##              Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length  0.361 -0.657 -0.582  0.315
## Sepal.Width         -0.730  0.598 -0.320
## Petal.Length  0.857  0.173        -0.480
## Petal.Width   0.358         0.546  0.754
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00
#means
iris.pca$center
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
#eignvectors
head(iris.pca$scores)
##         Comp.1     Comp.2      Comp.3       Comp.4
## [1,] -2.684126 -0.3193972 -0.02791483  0.002262437
## [2,] -2.714142  0.1770012 -0.21046427  0.099026550
## [3,] -2.888991  0.1449494  0.01790026  0.019968390
## [4,] -2.745343  0.3182990  0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545  0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304  0.16867766 -0.024200858

Results:

  • Total variance
  • Proportion of variance attribute to each PC
  • Scores - new coordinates in space described by PC axes. Vectors end at varible scores in biplot from origin.
  • Variable loadings - how much each variable ‘contributed’ to PC (absolute value)
  • Proportion of object’s total variance captured by PC

Plotting

Reading a biplot

Inform of distance between objects (type I) and correlative relationships between variables (type II). More see Legendre and Legendre (1998).

Assumptions

  • variables are linearly related
  • normal multivariate distribution
  • type II scaling - covariances/correlations are linear
  • can be represented in Euclidean space, non-euclidean dissimilarity measures need appropriate transformations

Troubleshooting

Horseshoe effect - variable have unimodal rather than linear relationships, consider correspondence analysis.
Many zeros - consider hellinger transofmration to linearise relationships or remove variables with concentrated zeros.
Origin of biplot is not zero - it is the center of standardized variation captured.

Rotation = loadings = coefficiencts of linear combinations of continuous variables

plot(iris.pca, type='l')