Indirect gradiaent analysis. Summarize, in a low-dimensional space, the variance in a multivariate scatter of points. Provides an overview of linear relationships between your objects and variables.
The first PC is constructed in the direction of maximum scatter/variability. Subsequent PCs are constructed in the same manner and are orthogonal (have no correlation with other PCs). Original variables are rescaled and can be represented in biplot. If few PCs capture most (70-90%) of the variance in original scatter, PCA represents variability well.
Get data:
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Look at relationships between variables:
plot(iris)
PCA results return a set of eigenvalues, principal components (PCs), and loadings (correlations between variables and PCs). Eigenvalues inform on the on data variability, PCs tell the structure of the observations, and loading indicate relationships between variables and associations with PCs.
iris.pca<-princomp(iris%>%select(-Species))
summary(iris.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.0494032 0.49097143 0.27872586 0.153870700
## Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
## Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000
##PCs
iris.pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 -0.657 -0.582 0.315
## Sepal.Width -0.730 0.598 -0.320
## Petal.Length 0.857 0.173 -0.480
## Petal.Width 0.358 0.546 0.754
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00
#means
iris.pca$center
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
#eignvectors
head(iris.pca$scores)
## Comp.1 Comp.2 Comp.3 Comp.4
## [1,] -2.684126 -0.3193972 -0.02791483 0.002262437
## [2,] -2.714142 0.1770012 -0.21046427 0.099026550
## [3,] -2.888991 0.1449494 0.01790026 0.019968390
## [4,] -2.745343 0.3182990 0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 0.16867766 -0.024200858
Inform of distance between objects (type I) and correlative relationships between variables (type II). More see Legendre and Legendre (1998).
Horseshoe effect - variable have unimodal rather than linear relationships, consider correspondence analysis.
Many zeros - consider hellinger transofmration to linearise relationships or remove variables with concentrated zeros.
Origin of biplot is not zero - it is the center of standardized variation captured.
Rotation = loadings = coefficiencts of linear combinations of continuous variables
plot(iris.pca, type='l')
Plot of variances (y-axis) associated with PC’s (x-axis). Use to determine how many PCs to retain. First 2 are good. First two PCs account for more than 95% of the variance.
library(ggplot2)
##extract data of interest
dr<-as.data.frame(iris.pca$scores)%>%
mutate(sp = iris$Species)
load<-as.data.frame(iris.pca$loadings[,1:2])
ggplot()+
geom_point(data=dr, aes(x=Comp.1, y=Comp.2, color=sp, shape=sp),size=3)+
geom_segment(data=load, aes(xend=Comp.1*3, yend=Comp.2*3, x=0, y=0), arrow=arrow(length = unit(.1, 'cm')))+
stat_ellipse(data=dr, aes(x=Comp.1, y=Comp.2, fill=sp), geom='polygon',alpha=.41)+
theme_mooney()+
geom_text(data=load, aes(x=Comp.1*3, y=Comp.2*3, label=rownames(load)), vjust=0, hjust=1)+
scale_shape_manual(values=c(17,21,3))